{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e0cd69ee",
   "metadata": {},
   "source": [
    "# Hybrid search with Qdrant\n",
    "\n",
    "Vector search based on dense embeddings captures the semantics of the data, so you don't have to use the same terms in queries and documents to still be able to find the relevant items. However, historically we were also using some other methods which rely on the presence of the keywords. Methods such as Bag-of-Words, TFIDF and BM25 are still useful and in some cases should be preferred over the dense embeddings. \n",
    "\n",
    "## Sparse vectors\n",
    "\n",
    "Surprisingly, keyword-based search is also implemented as vector search, but these vectors are usually sparse. That means the majority of the dimensions of such a vector are just zeros. A non-zero value at a particular vector dimension indicates the presence of a term from the dictionary assigned to that position. In other words, in sparse vectors, we have a dictionary in which each word/phrase gets its unique position. Since vectors are sparse, the dictionary can theoretically grow indefinitely, as we can append a new term at the very end. \n",
    "\n",
    "The fact of using a flexible dictionary, make the sparse vectors excel in exact matches, as they can cover texts that would be sets of random characters for the dense vectors - such as proper names or identifiers. Dense embedding models also have a dictionary, but once the model is trained, extending them is not that easy, and requires fine-tuning of the model. A typical user rarely goes that far.\n",
    "\n",
    "### BM25\n",
    "\n",
    "There are plenty of different options for creating sparse embeddings, but BM25 is an industry standard, and its most popular form comes from the 90s. It's a statistical model (no neural networks involved), which makes it really fast and lightweight. It's usually a solid baseline in search benchmarks so you should not ignore it.\n",
    "\n",
    "BM25 stands for Best Matching 25, and it was just the 25th attempt to create a formula that calculates how relevant a particular document is, given a query. If you are interested in mathematical background, please check out the [Wikipedia page](https://en.wikipedia.org/wiki/Okapi_BM25) that describes it really well. In general, BM25 is a ranking function that helps search engines determine how relevant a document is to a query by combining two key concepts: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**. \n",
    "\n",
    "1. The Term Frequency component rewards documents that contain the query terms multiple times, but with diminishing returns - so a document with 10 occurrences of a word isn't necessarily 10 times better than one with just 1 occurrence. \n",
    "2. The Inverse Document Frequency part boosts the importance of rare words while reducing the weight of common words that appear in many documents, since rare terms are typically more informative for distinguishing relevant results. \n",
    "\n",
    "BM25 also incorporates document length normalization to prevent longer documents from having an unfair advantage simply due to their size.\n",
    "\n",
    "In our case, we'll use an implementation available in FastEmbed. Let's start with the basics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17127ee0",
   "metadata": {},
   "source": [
    "## Step 0: Setup\n",
    "\n",
    "Please refer to [sematic_search.ipynb](sematic_search.ipynb) notebook to set up the libraries required to interact with Qdrant and to create the embeddings. Similarly, please start Qdrant in a Docker container as described there, if it's not running on your machine yet. \n",
    "\n",
    "If you skipped our previous lessons, the following commands will install all the packages and run Qdrant in a background container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d51cf88",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install -q \"qdrant-client[fastembed]>=1.14.2\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79f9cabd",
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker run -d -p 6333:6333 -p 6334:6334 \\\n",
    "   -v \"./qdrant_storage:/qdrant/storage:z\" \\\n",
    "   qdrant/qdrant"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53b32ff5",
   "metadata": {},
   "source": [
    "## Step 1: Connect to Qdrant\n",
    "\n",
    "Let's connect to Qdrant and test if that was successful by listing all the collections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bf62ebf9",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\lukaw\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "CollectionsResponse(collections=[])"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qdrant_client import QdrantClient\n",
    "\n",
    "client = QdrantClient(\"http://localhost:6333\")\n",
    "client.get_collections()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f525ae48",
   "metadata": {},
   "source": [
    "## Step 2: Sparse vector search with BM25\n",
    "\n",
    "We are going to use the same dataset as before. Let's download it and load into Qdrant, but this time we are going to create sparse vectors with BM25 only."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eeac8813",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'\n",
    "docs_response = requests.get(docs_url)\n",
    "documents_raw = docs_response.json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3c70e485",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'course': 'data-engineering-zoomcamp',\n",
       " 'documents': [{'text': \"The purpose of this document is to capture frequently asked technical questions\\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\\nSubscribe to course public Google Calendar (it works from Desktop only).\\nRegister before the course starts using this link.\\nJoin the course Telegram channel with announcements.\\nDon’t forget to register in DataTalks.Club's Slack and join the channel.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - When will the course start?'},\n",
       "  {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - What are the prerequisites for this course?'},\n",
       "  {'text': \"Yes, even if you don't register, you're still eligible to submit the homeworks.\\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - Can I still join the course after the start date?'},\n",
       "  {'text': \"You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?'},\n",
       "  {'text': 'You can start by installing and setting up all the dependencies and requirements:\\nGoogle cloud account\\nGoogle Cloud SDK\\nPython 3 (installed with Anaconda)\\nTerraform\\nGit\\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - What can I do before the course starts?'},\n",
       "  {'text': \"There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\\nData-Engineering (Jan - Apr)\\nMLOps (May - Aug)\\nMachine Learning (Sep - Jan)\\nThere's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.\\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - how many Zoomcamps in a year?'},\n",
       "  {'text': 'Yes. For the 2024 edition we are using Mage AI instead of Prefect and re-recorded the terraform videos, For 2023, we used Prefect instead of Airflow..',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - Is the current cohort going to be different from the previous cohort?'},\n",
       "  {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - Can I follow the course after it finishes?'},\n",
       "  {'text': 'Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don’t rely on its answers 100%, it is pretty good though.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - Can I get support if I take the course in the self-paced mode?'},\n",
       "  {'text': 'All the main videos are stored in the Main “DATA ENGINEERING” playlist (no year specified). The Github repository has also been updated to show each video with a thumbnail, that would bring you directly to the same playlist below.\\nBelow is the MAIN PLAYLIST’. And then you refer to the year specific playlist for additional videos for that year like for office hours videos etc. Also find this playlist pinned to the slack channel.\\nh\\nttps://youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&si=NspQhtZhZQs1B9F-',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - Which playlist on YouTube should I refer to?'},\n",
       "  {'text': 'It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\\nYou can also calculate it yourself using this data and then update this answer.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Course - \\u200b\\u200bHow many hours per week am I expected to spend on this  course?'},\n",
       "  {'text': \"No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Certificate - Can I follow the course in a self-paced mode and get a certificate?'},\n",
       "  {'text': 'The zoom link is only published to instructors/presenters/TAs.\\nStudents participate via Youtube Live and submit questions to Slido (link would be pinned in the chat when Alexey goes Live). The video URL should be posted in the announcements channel on Telegram & Slack before it begins. Also, you will see it live on the DataTalksClub YouTube Channel.\\nDon’t post your questions in chat as it would be off-screen before the instructors/moderators have a chance to answer it if the room is very active.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Office Hours - What is the video/zoom link to the stream for the “Office Hour” or workshop sessions?'},\n",
       "  {'text': 'Yes! Every “Office Hours” will be recorded and available a few minutes after the live session is over; so you can view (or rewatch) whenever you want.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Office Hours - I can’t attend the “Office hours” / workshop, will it be recorded?'},\n",
       "  {'text': 'You can find the latest and up-to-date deadlines here: https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml\\nAlso, take note of Announcements from @Au-Tomator for any extensions or other news. Or, the form may also show the updated deadline, if Instructor(s) has updated it.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Homework - What are homework and project deadlines?'},\n",
       "  {'text': 'No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\\nOlder news:[source1] [source2]',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Homework - Are late submissions of homework allowed?'},\n",
       "  {'text': 'Answer: In short, it’s your repository on github, gitlab, bitbucket, etc\\nIn long, your repository or any other location you have your code where a reasonable person would look at it and think yes, you went through the week and exercises.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Homework - What is the homework URL in the homework link?'},\n",
       "  {'text': 'After you submit your homework it will be graded based on the amount of questions in a particular homework. You can see how many points you have right on the page of the homework up top. Additionally in the leaderboard you will find the sum of all points you’ve earned - points for Homeworks, FAQs and Learning in Public. If homework is clear, others work as follows: if you submit something to FAQ, you get one point, for each learning in a public link you get one point.\\n(https://datatalks-club.slack.com/archives/C01FABYF2RG/p1706846846359379?thread_ts=1706825019.546229&cid=C01FABYF2RG)',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Homework and Leaderboard - what is the system for points in the course management platform?'},\n",
       "  {'text': 'When you set up your account you are automatically assigned a random name such as “Lucid Elbakyan” for example. If you want to see what your Display name is.\\nGo to the Homework submission link →  https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2 - Log in > Click on ‘Data Engineering Zoom Camp 2024’ > click on ‘Edit Course Profile’ - your display name is here, you can also change it should you wish:',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Leaderboard - I am not on the leaderboard / how do I know which one I am on the leaderboard?'},\n",
       "  {'text': 'Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]\\nBut Python 3.10 and 3.11 should work fine.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Is Python 3.9 still the recommended version to use in 2024?'},\n",
       "  {'text': 'You can set it up on your laptop or PC if you prefer to work locally from your laptop or PC.\\nYou might face some challenges, especially for Windows users. If you face cnd2\\nIf you prefer to work on the local machine, you may start with the week 1 Introduction to Docker and follow through.\\nHowever, if you prefer to set up a virtual machine, you may start with these first:\\nUsing GitHub Codespaces\\nSetting up the environment on a cloudV Mcodespace\\nI decided to work on a virtual machine because I have different laptops & PCs for my home & office, so I can work on this boot camp virtually anywhere.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Should I use my local machine, GCP, or GitHub Codespaces for my environment?'},\n",
       "  {'text': 'GitHub Codespaces offers you computing Linux resources with many pre-installed tools (Docker, Docker Compose, Python).\\nYou can also open any GitHub repository in a GitHub Codespace.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Is GitHub codespaces an alternative to using cli/git bash to ingest the data and create a docker file?'},\n",
       "  {'text': \"It's up to you which platform and environment you use for the course.\\nGithub codespaces or GCP VM are just possible options, but you can do the entire course from your laptop.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Do we really have to use GitHub codespaces? I already have PostgreSQL & Docker installed.'},\n",
       "  {'text': 'Choose the approach that aligns the most with your idea for the end project\\nOne of those should suffice. However, BigQuery, which is part of GCP, will be used, so learning that is probably a better option. Or you can set up a local environment for most of this course.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Do I need both GitHub Codespaces and GCP?'},\n",
       "  {'text': '1. To open Run command window, you can either:\\n(1-1) Use the shortcut keys: \\'Windows + R\\', or\\n(1-2) Right Click \"Start\", and click \"Run\" to open.\\n2. Registry Values Located in Registry Editor, to open it: Type \\'regedit\\' in the Run command window, and then press Enter.\\' 3. Now you can change the registry values \"Autorun\" in \"HKEY_CURRENT_USER\\\\Software\\\\Microsoft\\\\Command Processor\" from \"if exists\" to a blank.\\nAlternatively, You can simplify the solution by deleting the fingerprint saved within the known_hosts file. In Windows, this file is placed at  C:\\\\Users\\\\<your_user_name>\\\\.ssh\\\\known_host',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'This happens when attempting to connect to a GCP VM using VSCode on a Windows machine. Changing registry value in registry editor'},\n",
       "  {'text': 'For uniformity at least, but you’re not restricted to GCP, you can use other cloud platforms like AWS if you’re comfortable with other cloud platforms, since you get every service that’s been provided by GCP in Azure and AWS or others..\\nBecause everyone has a google account, GCP has a free trial period and gives $300 in credits  to new users. Also, we are working with BigQuery, which is a part of GCP.\\nNote that to sign up for a free GCP account, you must have a valid credit card.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Why are we using GCP and not other cloud providers?'},\n",
       "  {'text': 'No, if you use GCP and take advantage of their free trial.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Should I pay for cloud services?'},\n",
       "  {'text': 'You can do most of the course without a cloud. Almost everything we use (excluding BigQuery) can be run locally. We won’t be able to provide guidelines for some things, but most of the materials are runnable without GCP.\\nFor everything in the course, there’s a local alternative. You could even do the whole course locally.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - The GCP and other cloud providers are unavailable in some countries. Is it possible to provide a guide to installing a home lab?'},\n",
       "  {'text': 'Yes, you can. Just remember to adapt all the information on the videos to AWS. Besides, the final capstone will be evaluated based on the task: Create a data pipeline! Develop a visualisation!\\nThe problem would be when you need help. You’d need to rely on  fellow coursemates who also use AWS (or have experience using it before), which might be in smaller numbers than those learning the course with GCP.\\nAlso see Is it possible to use x tool instead of the one tool you use?',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - I want to use AWS. May I do that?'},\n",
       "  {'text': 'We will probably have some calls during the Capstone period to clear some questions but it will be announced in advance if that happens.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Besides the “Office Hour” which are the live zoom calls?'},\n",
       "  {'text': 'We will use the same data, as the project will essentially remain the same as last year’s. The data is available here',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?'},\n",
       "  {'text': 'No, but we moved the 2022 stuff here',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Is the 2022 repo deleted?'},\n",
       "  {'text': 'Yes, you can use any tool you want for your project.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Can I use Airflow instead for my final project?'},\n",
       "  {'text': 'Yes, this applies if you want to use Airflow or Prefect instead of Mage, AWS or Snowflake instead of GCP products or Tableau instead of Metabase or Google data studio.\\nThe course covers 2 alternative data stacks, one using GCP and one using local installation of everything. You can use one of them or use your tool of choice.\\nShould you consider it instead of the one tool you use? That we can’t support you if you choose to use a different stack, also you would need to explain the different choices of tool for the peer review of your capstone project.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Is it possible to use tool “X” instead of the one tool you use in the course?'},\n",
       "  {'text': 'Star the repo! Share it with friends if you find it useful ❣️\\nCreate a PR if you see you can improve the text or the structure of the repository.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'How can we contribute to the course?'},\n",
       "  {'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Is the course [Windows/mac/Linux/...] friendly?'},\n",
       "  {'text': \"Have no idea how past cohorts got past this as I haven't read old slack messages, and no FAQ entries that I can find.\\nLater modules (module-05 & RisingWave workshop) use shell scripts in *.sh files and most Windows users not using WSL would hit a wall and cannot continue, even in git bash or MINGW64. This is why WSL environment setup is recommended from the start.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Environment - Roadblock for Windows users in modules with *.sh (shell scripts).'},\n",
       "  {'text': 'Yes to both! check out this document: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Any books or additional resources you recommend?'},\n",
       "  {'text': 'You will have two attempts for a project. If the first project deadline is over and you’re late or you submit the project and fail the first attempt, you have another chance to submit the project with the second attempt.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Project - What is Project Attemp #1 and Project Attempt #2 exactly?'},\n",
       "  {'text': \"The first step is to try to solve the issue on your own. Get used to solving problems and reading documentation. This will be a real life skill you need when employed. [ctrl+f] is your friend, use it! It is a universal shortcut and works in all apps/browsers.\\nWhat does the error say? There will often be a description of the error or instructions on what is needed or even how to fix it. I have even seen a link to the solution. Does it reference a specific line of your code?\\nRestart app or server/pc.\\nGoogle it, use ChatGPT, Bing AI etc.\\nIt is going to be rare that you are the first to have the problem, someone out there has posted the fly issue and likely the solution.\\nSearch using: <technology> <problem statement>. Example: pgcli error column c.relhasoids does not exist.\\nThere are often different solutions for the same problem due to variation in environments.\\nCheck the tech’s documentation. Use its search if available or use the browsers search function.\\nTry uninstall (this may remove the bad actor) and reinstall of application or reimplementation of action. Remember to restart the server/pc for reinstalls.\\nSometimes reinstalling fails to resolve the issue but works if you uninstall first.\\nPost your question to Stackoverflow. Read the Stackoverflow guide on posting good questions.\\nhttps://stackoverflow.com/help/how-to-ask\\nThis will be your real life. Ask an expert in the future (in addition to coworkers).\\nAsk in Slack\\nBefore asking a question,\\nCheck Pins (where the shortcut to the repo and this FAQ is located)\\nUse the slack app’s search function\\nUse the bot @ZoomcampQABot to do the search for you\\ncheck the FAQ (this document), use search [ctrl+f]\\nWhen asking a question, include as much information as possible:\\nWhat are you coding on? What OS?\\nWhat command did you run, which video did you follow? Etc etc\\nWhat error did you get? Does it have a line number to the “offending” code and have you check it for typos?\\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.\\nDO NOT use screenshots, especially don’t take pictures from a phone.\\nDO NOT tag instructors, it may discourage others from helping you. Copy and paste errors; if it’s long, just post it in a reply to your thread.\\nUse ``` for formatting your code.\\nUse the same thread for the conversation (that means reply to your own thread).\\nDO NOT create multiple posts to discuss the issue.\\nlearYou may create a new post if the issue reemerges down the road. Describe what has changed in the environment.\\nProvide additional information in the same thread of the steps you have taken for resolution.\\nTake a break and come back later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day.\\nRemember technology issues in real life sometimes take days or even weeks to resolve.\\nIf somebody helped you with your problem and it's not in the FAQ, please add it there. It will help other students.\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'How to troubleshoot issues'},\n",
       "  {'text': 'When the troubleshooting guide above does not help resolve it and you need another pair of eyeballs to spot mistakes. When asking a question, include as much information as possible:\\nWhat are you coding on? What OS?\\nWhat command did you run, which video did you follow? Etc etc\\nWhat error did you get? Does it have a line number to the “offending” code and have you check it for typos?\\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'How to ask questions'},\n",
       "  {'text': 'After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\\nHaving this local repository on your computer will make it easy for you to access the instructors’ code and make pull requests (if you want to add your own notes or make changes to the course content).\\nYou will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial that shows you how to do this: https://www.atlassian.com/git/tutorials/setting-up-a-repository\\nRemember to ignore large database, .csv, and .gz files, and other files that should not be saved to a repository. Use .gitignore for this: https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or keys in a git repo (even if that repo is set to private).\\nThis is also a great resource: https://dangitgit.com/',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'How do I use Git / GitHub for this course?'},\n",
       "  {'text': 'Error: Makefile:2: *** missing separator.  Stop.\\nSolution: Tabs in document should be converted to Tab instead of spaces. Follow this stack.',\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'VS Code: Tab using spaces'},\n",
       "  {'text': \"If you’re running Linux on Windows Subsystem for Linux (WSL) 2, you can open HTML files from the guest (Linux) with whatever Internet Browser you have installed on the host (Windows). Just install wslu and open the page with wslview <file>, for example:\\nwslview index.html\\nYou can customise which browser to use by setting the BROWSER environment variable first. For example:\\nexport BROWSER='/mnt/c/Program Files/Firefox/firefox.exe'\",\n",
       "   'section': 'General course-related questions',\n",
       "   'question': 'Opening an HTML file with a Windows browser from Linux running on WSL'},\n",
       "  {'text': 'This tutorial shows you how to set up the Chrome Remote Desktop service on a Debian Linux virtual machine (VM) instance on Compute Engine. Chrome Remote Desktop allows you to remotely access applications with a graphical user interface.\\nTaxi Data - Yellow Taxi Trip Records downloading error, Error no or XML error webpage\\nWhen you try to download the 2021 data from TLC website, you get this error:\\nIf you click on the link, and ERROR 403: Forbidden on the terminal.\\nWe have a backup, so use it instead: https://github.com/DataTalksClub/nyc-tlc-data\\nSo the link should be https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz\\nNote: Make sure to unzip the “gz” file (no, the “unzip” command won’t work for this.)\\n“gzip -d file.gz”g',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Set up Chrome Remote Desktop for Linux on Compute Engine'},\n",
       "  {'text': 'In this video, we store the data file as “output.csv”. The data file won’t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = “output.cs -v” with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = “output.csv” with\\ncsv_name = url.split(“/”)[-1] . Then when we use csv_name to using pd.read_csv, there won’t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Taxi Data - How to handle taxi data files, now that the files are available as *.csv.gz?'},\n",
       "  {'text': 'Yellow Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf\\nGreen Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Taxi Data - Data Dictionary for NY Taxi data?'},\n",
       "  {'text': 'You can unzip this downloaded parquet file, in the command line. The result is a csv file which can be imported with pandas using the pd.read_csv() shown in the videos.\\n‘’’gunzip green_tripdata_2019-09.csv.gz’’’\\nSOLUTION TO USING PARQUET FILES DIRECTLY IN PYTHON SCRIPT ingest_data.py\\nIn the def main(params) add this line\\nparquet_name= \\'output.parquet\\'\\nThen edit the code which downloads the files\\nos.system(f\"wget {url} -O {parquet_name}\")\\nConvert the download .parquet file to csv and rename as csv_name to keep it relevant to the rest of the code\\ndf = pd.read_parquet(parquet_name)\\ndf.to_csv(csv_name, index=False)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Taxi Data - Unzip Parquet file'},\n",
       "  {'text': '“wget is not recognized as an internal or external command”, you need to install it.\\nOn Ubuntu, run:\\n$ sudo apt-get install wget\\nOn MacOS, the easiest way to install wget is to use Brew:\\n$ brew install wget\\nOn Windows, the easiest way to install wget is to use Chocolatey:\\n$ choco install wget\\nOr you can download a binary (https://gnuwin32.sourceforge.net/packages/wget.htm) and put it to any location in your PATH (e.g. C:/tools/)\\nAlso, you can following this step to install Wget on MS Windows\\n* Download the latest wget binary for windows from [eternallybored] (https://eternallybored.org/misc/wget/) (they are available as a zip with documentation, or just an exe)\\n* If you downloaded the zip, extract all (if windows built in zip utility gives an error, use [7-zip] (https://7-zip.org/)).\\n* Rename the file `wget64.exe` to `wget.exe` if necessary.\\n* Move wget.exe to your `Git\\\\mingw64\\\\bin\\\\`.\\nAlternatively, you can use a Python wget library, but instead of simply using “wget” you’ll need to use\\npython -m wget\\nYou need to install it with pip first:\\npip install wget\\nAlternatively, you can just paste the file URL into your web browser and download the file normally that way. You’ll want to move the resulting file into your working directory.\\nAlso recommended a look at the python library requests for the loading gz file  https://pypi.org/project/requests',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'lwget is not recognized as an internal or external command'},\n",
       "  {'text': 'Firstly, make sure that you add “!” before wget if you’re running your command in a Jupyter Notebook or CLI. Then, you can check one of this 2 things (from CLI):\\nUsing the Python library wget you installed with pip, try python -m wget <url>\\nWrite the usual command and add --no-check-certificate at the end. So it should be:\\n!wget <website_url> --no-check-certificate',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'wget - ERROR: cannot verify <website> certificate  (MacOS)'},\n",
       "  {'text': 'For those who wish to use the backslash as an escape character in Git Bash for Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\\\\ (no need to include in .bashrc)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Git Bash - Backslash as an escape character in Git Bash for Windows'},\n",
       "  {'text': 'Instruction on how to store secrets that will be avialable in GitHub  Codespaces.\\nManaging your account-specific secrets for GitHub Codespaces - GitHub Docs',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GitHub Codespaces - How to store secrets'},\n",
       "  {'text': \"Make sure you're able to start the Docker daemon, and check the issue immediately down below:\\nAnd don’t forget to update the wsl in powershell the  command is wsl –update\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Cannot connect to Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?'},\n",
       "  {'text': \"As the official Docker for Windows documentation says, the Docker engine can either use the\\nHyper-V or WSL2 as its backend. However, a few constraints might apply\\nWindows 10 Pro / 11 Pro Users: \\nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\\nWindows 10 Home / 11 Home Users: \\nOn the other hand, Users of the 'Home' version do NOT have the option Hyper-V option enabled, which means, you can only get Docker up and running using the WSL2 credentials(Windows Subsystem for Linux). Url\\nYou can find the detailed instructions to do so here: rt ghttps://pureinfotech.com/install-wsl-windows-11/\\nIn case, you run into another issue while trying to install WSL2 (WslRegisterDistribution failed with error: 0x800701bc), Make sure you update the WSL2 Linux Kernel, following the guidelines here: \\n\\nhttps://github.com/microsoft/WSL/issues/5393\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post: \"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/create\" : open //./pipe/docker_engine: The system cannot find the file specified'},\n",
       "  {'text': 'Whenever a `docker pull is performed (either manually or by `docker-compose up`), it attempts to fetch the given image name (pgadmin4, for the example above) from a repository (dbpage).\\nIF the repository is public, the fetch and download happens without any issue whatsoever.\\nFor instance:\\ndocker pull postgres:13\\ndocker pull dpage/pgadmin4\\nBE ADVISED:\\n\\nThe Docker Images we\\'ll be using throughout the Data Engineering Zoomcamp are all public (except when or if explicitly said otherwise by the instructors or co-instructors).\\n\\nMeaning: you are NOT required to perform a docker login to fetch them. \\n\\nSo if you get the message above saying \"docker login\\': denied: requested access to the resource is denied. That is most likely due to a typo in your image name:\\n\\nFor instance:\\n$ docker pull dbpage/pgadmin4\\nWill throw that exception telling you \"repository does not exist or may require \\'docker login\\'\\nError response from daemon: pull access denied for dbpage/pgadmin4, repository does not exist or \\nmay require \\'docker login\\': denied: requested access to the resource is denied\\nBut that actually happened because the actual image is dpage/pgadmin4 and NOT dbpage/pgadmin4\\nHow to fix it:\\n$ docker pull dpage/pgadmin4\\nEXTRA NOTES:\\nIn the real world, occasionally, when you\\'re working for a company or closed organisation, the Docker image you\\'re trying to fetch might be under a private repo that your DockerHub Username was granted access to.\\nFor which cases, you must first execute:\\n$ docker login\\nFill in the details of your username and password.\\nAnd only then perform the `docker pull` against that private repository\\nWhy am I encountering a \"permission denied\" error when creating a PostgreSQL Docker container for the New York Taxi Database with a mounted volume on macOS M1?\\nIssue Description:\\nWhen attempting to run a Docker command similar to the one below:\\ndocker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \\\\\\n-p 5432:5432 \\\\mount\\npostgres:13\\nYou encounter the error message:\\ndocker: Error response from daemon: error while creating mount source path \\'/path/to/ny_taxi_postgres_data\\': chown /path/to/ny_taxi_postgres_data: permission denied.\\nSolution:\\n1- Stop Rancher Desktop:\\nIf you are using Rancher Desktop and face this issue, stop Rancher Desktop to resolve compatibility problems.\\n2- Install Docker Desktop:\\nInstall Docker Desktop, ensuring that it is properly configured and has the required permissions.\\n2-Retry Docker Command:\\nRun the Docker command again after switching to Docker Desktop. This step resolves compatibility issues on some systems.\\nNote: The issue occurred because Rancher Desktop was in use. Switching to Docker Desktop resolves compatibility problems and allows for the successful creation of PostgreSQL containers with mounted volumes for the New York Taxi Database on macOS M1.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - docker pull dbpage'},\n",
       "  {'text': 'When I runned command to create postgre in docker container it created folder on my local machine to mount it to volume inside container. It has write and read protection and owned by user 999, so I could not delete it by simply drag to trash.  My obsidian could not started due to access error, so I had to change placement of this folder and delete old folder by this command:\\nsudo rm -r -f docker_test/\\n- where `rm` - remove, `-r` - recursively, `-f` - force, `docker_test/` - folder.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - can’t delete local folder that mounted to docker volume'},\n",
       "  {'text': 'First off, make sure you\\'re running the latest version of Docker for Windows, which you can download from here. Sometimes using the menu to \"Upgrade\" doesn\\'t work (which is another clear indicator for you to uninstall, and reinstall with the latest version)\\nIf docker is stuck on starting, first try to switch containers by right clicking the docker symbol from the running programs and switch the containers from windows to linux or vice versa\\n[Windows 10 / 11 Pro Edition] The Pro Edition of Windows can run Docker either by using Hyper-V or WSL2 as its backend (Docker Engine)\\nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\\nIf you opt-in for WSL2, you can follow the same steps as detailed in the tutorial here',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Docker - Docker won't start or is stuck in settings (Windows 10 / 11)\"},\n",
       "  {'text': \"It is recommended by the Docker do\\n[Windows 10 / 11 Home Edition] If you're running a Home Edition, you can still make it work with WSL2 (Windows Subsystem for Linux) by following the tutorial here\\nIf even after making sure your WSL2 (or Hyper-V) is set up accordingly, Docker remains stuck, you can try the option to Reset to Factory Defaults or do a fresh install.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Should I run docker commands from the windows file system or a file system of a Linux distribution in WSL?'},\n",
       "  {'text': 'More info in the Docker Docs on Best Practises',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - cs to store all code in your default Linux distro to get the best out of file system performance (since Docker runs on WSL2 backend by default for Windows 10 Home / Windows 11 Home users).'},\n",
       "  {'text': 'You may have this error:\\n$ docker run -it ubuntu bash\\nthe input device is not a TTY. If you are using mintty, try prefixing the command with \\'winpty\\'\\nerror:\\nSolution:\\nUse winpty before docker command (source)\\n$ winpty docker run -it ubuntu bash\\nYou also can make an alias:\\necho \"alias docker=\\'winpty docker\\'\" >> ~/.bashrc\\nOR\\necho \"alias docker=\\'winpty docker\\'\" >> ~/.bash_profile',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - The input device is not a TTY (Docker run for Windows)'},\n",
       "  {'text': \"You may have this error:\\nRetrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u\\nrllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':\\n/simple/pandas/\\nPossible solution might be:\\n$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Cannot pip install on Docker container (Windows)'},\n",
       "  {'text': 'Even after properly running the docker script the folder is empty in the vs code  then try this (For Windows)\\nwinpty docker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v \"C:\\\\Users\\\\abhin\\\\dataengg\\\\DE_Project_git_connected\\\\DE_OLD\\\\week1_set_up\\\\docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data\" \\\\\\n-p 5432:5432 \\\\\\npostgres:13\\nHere quoting the absolute path in  the -v parameter is solving the issue and all the files are visible in the Vs-code ny_taxi folder as shown in the video',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - ny_taxi_postgres_data is empty'},\n",
       "  {'text': 'Check this article for details - Setting up docker in macOS\\nFrom researching it seems this method might be out of date, it seems that since docker changed their licensing model, the above is a bit hit and miss. What worked for me was to just go to the docker website and download their dmg. Haven’t had an issue with that method.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'dasDocker - Setting up Docker on Mac'},\n",
       "  {'text': '$ docker run -it\\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"admin\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v \"/mnt/path/to/ny_taxi_postgres_data\":\"/var/lib/postgresql/data\" \\\\\\n-p 5432:5432 \\\\\\npostgres:13\\nCCW\\nThe files belonging to this database system will be owned by user \"postgres\".\\nThis use The database cluster will be initialized with locale \"en_US.utf8\".\\nThe default databerrorase encoding has accordingly been set to \"UTF8\".\\nxt search configuration will be set to \"english\".\\nData page checksums are disabled.\\nfixing permissions on existing directory /var/lib/postgresql/data ... initdb: f\\nerror: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted  volume\\nOne way to solve this issue is to create a local docker volume and map it to postgres data directory /var/lib/postgresql/data\\nThe input dtc_postgres_volume_local must match in both commands below\\n$ docker volume create --name dtc_postgres_volume_local -d local\\n$ docker run -it\\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v dtc_postgres_volume_local:/var/lib/postgresql/data \\\\\\n-p 5432:5432\\\\\\npostgres:13\\nTo verify the above command works in (WSL2 Ubuntu 22.04, verified 2024-Jan), go to the Docker Desktop app and look under Volumes - dtc_postgres_volume_local would be listed there. The folder ny_taxi_postgres_data would however be empty, since we used an alternative config.\\nAn alternate error could be:\\ninitdb: error: directory \"/var/lib/postgresql/data\" exists but is not empty\\nIf you want to create a new database system, either remove or empthe directory \"/var/lib/postgresql/data\" or run initdb\\nwitls',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': '1Docker - Could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted'},\n",
       "  {'text': 'Mapping volumes on Windows could be tricky. The way it was done in the course video doesn’t work for everyone.\\nFirst, if yo\\nmove your data to some folder without spaces. E.g. if your code is in “C:/Users/Alexey Grigorev/git/…”, move it to “C:/git/…”\\nTry replacing the “-v” part with one of the following options:\\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\\n-v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\\n-v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\\n-v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\\n--volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data\\nwinpty docker run -it\\n-e POSTGRES_USER=\"root\"\\n-e POSTGRES_PASSWORD=\"root\"\\n-e POSTGRES_DB=\"ny_taxi\"\\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\\n-p 5432:5432\\npostgres:1\\nTry adding winpty before the whole command\\n3\\nwin\\nTry adding quotes:\\n-v \"/c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\\n-v \"//c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\\n-v “/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\\n-v \"//c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\\n-v \"c:\\\\some\\\\path\\\\ny_taxi_postgres_data\":/var/lib/postgresql/data\\nNote:  (Window) if it automatically creates a folder called “ny_taxi_postgres_data;C” suggests you have problems with volume mapping, try deleting both folders and replacing “-v” part with other options. For me “//c/” works instead of “/c/”. And it will work by automatically creating a correct folder called “ny_taxi_postgres_data”.\\nA possible solution to this error would be to use /”$(pwd)”/ny_taxi_postgres_data:/var/lib/postgresql/data (with quotes’ position varying as in the above list).\\nYes for windows use the command it works perfectly fine\\n-v /”$(pwd)”/ny_taxi_postgres_data:/var/lib/postgresql/data\\nImportant: note how the quotes are placed.\\nIf none of these options work, you can use a volume name instead of the path:\\n-v ny_taxi_postgres_data:/var/lib/postgresql/data\\nFor Mac: You can wrap $(pwd) with quotes like the highlighted.\\ndocker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\\\\n-p 5432:5432 \\\\\\nPostgres:13\\ndocker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\\\\n-p 5432:5432 \\\\\\npostgres:13\\nSource:https://stackoverflow.com/questions/48522615/docker-error-invalid-reference-format-repository-name-must-be-lowercase',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - invalid reference format: repository name must be lowercase (Mounting volumes with Docker on Windows)'},\n",
       "  {'text': 'Change the mounting path. Replace it with one of following:\\n-v /e/zoomcamp/...:/var/lib/postgresql/data\\n-v /c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data\\\\ (leading slash in front of c:)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Error response from daemon: invalid mode: \\\\Program Files\\\\Git\\\\var\\\\lib\\\\postgresql\\\\data.'},\n",
       "  {'text': 'When you run this command second time\\ndocker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-v <your path>:/var/lib/postgresql/data \\\\\\n-p 5432:5432 \\\\\\npostgres:13\\nThe error message above could happen. That means you should not mount on the second run. This command helped me:\\nWhen you run this command second time\\ndocker run -it \\\\\\n-e POSTGRES_USER=\"root\" \\\\\\n-e POSTGRES_PASSWORD=\"root\" \\\\\\n-e POSTGRES_DB=\"ny_taxi\" \\\\\\n-p 5432:5432 \\\\\\npostgres:13',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Docker - Error response from daemon: error while creating buildmount source path '/run/desktop/mnt/host/c/<your path>': mkdir /run/desktop/mnt/host/c: file exists\"},\n",
       "  {'text': 'This error appeared when running the command: docker build -t taxi_ingest:v001 .\\nWhen feeding the database with the data the user id of the directory ny_taxi_postgres_data was changed to 999, so my user couldn’t access it when running the above command. Even though this is not the problem here it helped to raise the error due to the permission issue.\\nSince at this point we only need the files Dockerfile and ingest_data.py, to fix this error one can run the docker build command on a different directory (having only these two files).\\nA more complete explanation can be found here: https://stackoverflow.com/questions/41286028/docker-build-error-checking-context-cant-stat-c-users-username-appdata\\nYou can fix the problem by changing the permission of the directory on ubuntu with following command:\\nsudo chown -R $USER dir_path\\nOn windows follow the link: https://thegeekpage.com/take-ownership-of-a-file-folder-through-command-prompt-in-windows-10/ \\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tAdded by\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tKenan Arslanbay',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Docker - build error: error checking context: 'can't stat '/home/user/repos/data-engineering/week_1_basics_n_setup/2_docker_sql/ny_taxi_postgres_data''.\"},\n",
       "  {'text': 'You might have installed docker via snap. Run “sudo snap status docker” to verify.\\nIf you have “error: unknown command \"status\", see \\'snap help\\'.” as a response than deinstall docker and install via the official website\\nBind for 0.0.0.0:5432 failed: port is a',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - ERRO[0000] error waiting for container: context canceled'},\n",
       "  {'text': 'Found the issue in the PopOS linux. It happened because our user didn’t have authorization rights to the host folder ( which also caused folder seems empty, but it didn’t!).\\n✅Solution:\\nJust add permission for everyone to the corresponding folder\\nsudo chmod -R 777 <path_to_folder>\\nExample:\\nsudo chmod -R 777 ny_taxi_postgres_data/',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - build error checking context: can’t stat ‘/home/fhrzn/Projects/…./ny_taxi_postgres_data’'},\n",
       "  {'text': 'This happens on Ubuntu/Linux systems when trying to run the command to build the Docker container again.\\n$ docker build -t taxi_ingest:v001 .\\nA folder is created to host the Docker files. When the build command is executed again to rebuild the pipeline or create a new one the error is raised as there are no permissions on this new folder. Grant permissions by running this comtionmand;\\n$ sudo chmod -R 755 ny_taxi_postgres_data\\nOr use 777 if you still see problems. 755 grants write access to only the owner.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - failed to solve with frontend dockerfile.v0: failed to read dockerfile: error from sender: open ny_taxi_postgres_data: permission denied.'},\n",
       "  {'text': 'Get the network name via: $ docker network ls.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Docker network name'},\n",
       "  {'text': 'Sometimes, when you try to restart a docker image configured with a network name, the above message appears. In this case, use the following command with the appropriate container name:\\n>>> If the container is running state, use docker stop <container_name>\\n>>> then, docker rm pg-database\\nOr use docker start instead of docker run in order to restart the docker image without removing it.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Error response from daemon: Conflict. The container name \"pg-database\" is already in use by container “xxx”.  You have to remove (or rename) that container to be able to reuse that name.'},\n",
       "  {'text': 'Typical error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name \"pgdatabase\" to address: Name or service not known\\nWhen running docker-compose up -d see which network is created and use this for the ingestions script instead of pg-network and see the name of the database to use instead of pgdatabase\\nE.g.:\\npg-network becomes 2docker_default\\nPgdatabase becomes 2docker-pgdatabase-1',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - ingestion when using docker-compose could not translate host name'},\n",
       "  {'text': 'terraformRun this command before starting your VM:\\nOn Intel CPU:\\nmodprobe -r kvm_intel\\nmodprobe kvm_intel nested=1\\nOn AMD CPU:\\nmodprobe -r kvm_amd\\nmodprobe kvm_amd nested=1',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Cannot install docker on MacOS/Windows 11 VM running on top of Linux (due to Nested virtualization).'},\n",
       "  {'text': 'It’s very easy to manage your docker container, images, network and compose projects from VS Code.\\nJust install the official extension and launch it from the left side icon.\\nIt will work even if your Docker runs on WSL2, as VS Code can easily connect with your Linux.\\nDocker - How to stop a container?\\nUse the following command:\\n$ docker stop <container_id>',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - Connecting from VS Code'},\n",
       "  {'text': \"When you see this in logs, your container with postgres is not accepting any requests, so if you attempt to connect, you'll get this error:\\nconnection failed: server closed the connection unexpectedly\\nThis probably means the server terminated abnormally before or while processing the request.\\nIn this case, you need to delete the directory with data (the one you map to the container with the -v flag) and restart the container.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - PostgreSQL Database directory appears to contain a database. Database system is shut down'},\n",
       "  {'text': 'On few versions of Ubuntu, snap command can be used to install Docker.\\nsudo snap install docker',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker not installable on Ubuntu'},\n",
       "  {'text': 'error: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted  volume\\nif you have used the prev answer (just before this) and have created a local docker volume, then you need to tell the compose file about the named volume:\\nvolumes:\\ndtc_postgres_volume_local:  # Define the named volume here\\n# services mentioned in the compose file auto become part of the same network!\\nservices:\\nyour remaining code here . . .\\nnow use docker volume inspect dtc_postgres_volume_local to see the location by checking the value of Mountpoint\\nIn my case, after i ran docker compose up the mounting dir created was named ‘docker_sql_dtc_postgres_volume_local’ whereas it should have used the already existing ‘dtc_postgres_volume_local’\\nAll i did to fix this is that I renamed the existing ‘dtc_postgres_volume_local’ to ‘docker_sql_dtc_postgres_volume_local’ and removed the newly created one (just be careful when doing this)\\nrun docker compose up again and check if the table is there or not!',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - mounting error'},\n",
       "  {'text': 'Couldn’t translate host name to address\\nMake sure postgres database is running.\\n\\n\\u200b\\u200bUse the command to start containers in detached mode: docker-compose up -d\\n(data-engineering-zoomcamp) hw % docker compose up -d\\n[+] Running 2/2\\n⠿ Container pg-admin     Started                                                                                                                                                                      0.6s\\n⠿ Container pg-database  Started\\nTo view the containers use: docker ps.\\n(data-engineering-zoomcamp) hw % docker ps\\nCONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS          PORTS                           NAMES\\nfaf05090972e   postgres:13      \"docker-entrypoint.s…\"   39 seconds ago   Up 37 seconds   0.0.0.0:5432->5432/tcp          pg-database\\n6344dcecd58f   dpage/pgadmin4   \"/entrypoint.sh\"         39 seconds ago   Up 37 seconds   443/tcp, 0.0.0.0:8080->80/tcp   pg-admin\\nhw\\nTo view logs for a container: docker logs <containerid>\\n(data-engineering-zoomcamp) hw % docker logs faf05090972e\\nPostgreSQL Database directory appears to contain a database; Skipping initialization\\n2022-01-25 05:58:45.948 UTC [1] LOG:  starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit\\n2022-01-25 05:58:45.948 UTC [1] LOG:  listening on IPv4 address \"0.0.0.0\", port 5432\\n2022-01-25 05:58:45.948 UTC [1] LOG:  listening on IPv6 address \"::\", port 5432\\n2022-01-25 05:58:45.954 UTC [1] LOG:  listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\\n2022-01-25 05:58:45.984 UTC [28] LOG:  database system was interrupted; last known up at 2022-01-24 17:48:35 UTC\\n2022-01-25 05:58:48.581 UTC [28] LOG:  database system was not properly shut down; automatic recovery in\\nprogress\\n2022-01-25 05:58:48.602 UTC [28] LOG:  redo starts at 0/872A5910\\n2022-01-25 05:59:33.726 UTC [28] LOG:  invalid record length at 0/98A3C160: wanted 24, got 0\\n2022-01-25 05:59:33.726 UTC [28\\n] LOG:  redo done at 0/98A3C128\\n2022-01-25 05:59:48.051 UTC [1] LOG:  database system is ready to accept connections\\nIf docker ps doesn’t show pgdatabase running, run: docker ps -a\\nThis should show all containers, either running or stopped.\\nGet the container id for pgdatabase-1, and run',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Error translating host name to address'},\n",
       "  {'text': 'After executing `docker-compose up` - if you lose database data and are unable to successfully execute your Ingestion script (to re-populate your database) but receive the following error:\\nsqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name /data_pgadmin:/var/lib/pgadmin\"pg-database\" to address: Name or service not known\\nDocker compose is creating its own default network since it is no longer specified in a docker execution command or file. Docker Compose will emit to logs the new network name. See the logs after executing `docker compose up` to find the network name and change the network name argument in your Ingestion script.\\nIf problems persist with pgcli, we can use HeidiSQL,usql\\nKrishna Anand',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose -  Data retention (could not translate host name \"pg-database\" to address: Name or service not known)'},\n",
       "  {'text': 'It returns --> Error response from daemon: network 66ae65944d643fdebbc89bd0329f1409dec2c9e12248052f5f4c4be7d1bdc6a3 not found\\nTry:\\ndocker ps -a to see all the stopped & running containers\\nd to nuke all the containers\\nTry: docker-compose up -d again ports\\nOn localhost:8080 server → Unable to connect to server: could not translate host name \\'pg-database\\' to address: Name does not resolve\\nTry: new host name, best without “ - ” e.g. pgdatabase\\nAnd on docker-compose.yml, should specify docker network & specify the same network in both  containers\\nservices:\\npgdatabase:\\nimage: postgres:13\\nenvironment:\\n- POSTGRES_USER=root\\n- POSTGRES_PASSWORD=root\\n- POSTGRES_DB=ny_taxi\\nvolumes:\\n- \"./ny_taxi_postgres_data:/var/lib/postgresql/data:rw\"\\nports:\\n- \"5431:5432\"\\nnetworks:\\n- pg-network\\npgadmin:\\nimage: dpage/pgadmin4\\nenvironment:\\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\\n- PGADMIN_DEFAULT_PASSWORD=root\\nports:\\n- \"8080:80\"\\nnetworks:\\n- pg-network\\nnetworks:\\npg-network:\\nname: pg-network',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Hostname does not resolve'},\n",
       "  {'text': 'So one common issue is when you run docker-compose on GCP, postgres won’t persist it’s data to mentioned path for example:\\nservices:\\n…\\n…\\npgadmin:\\n…\\n…\\nVolumes:\\n“./pgadmin”:/var/lib/pgadmin:wr”\\nMight not work so in this use you can use Docker Volume to make it persist, by simply changing\\nservices:\\n…\\n….\\npgadmin:\\n…\\n…\\nVolumes:\\npgadmin:/var/lib/pgadmin\\nvolumes:\\nPgadmin:',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Persist PGAdmin docker contents on GCP'},\n",
       "  {'text': 'The docker will keep on crashing continuously\\nNot working after restart\\ndocker engine stopped\\nAnd failed to fetch extensions pop ups will on screen non-stop\\nSolution :\\nTry checking if latest version of docker is installed / Try updating the docker\\nIf Problem still persist then final solution is to reinstall docker\\n(Just have to fetch images again else no issues)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker engine stopped_failed to fetch extensions'},\n",
       "  {'text': 'As per the lessons,\\nPersisting pgAdmin configuration (i.e. server name) is done by adding a “volumes” section:\\nservices:\\npgdatabase:\\n[...]\\npgadmin:\\nimage: dpage/pgadmin4\\nenvironment:\\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\\n- PGADMIN_DEFAULT_PASSWORD=root\\nvolumes:\\n- \"./pgAdmin_data:/var/lib/pgadmin/sessions:rw\"\\nports:\\n- \"8080:80\"\\nIn the example above, ”pgAdmin_data” is a folder on the host machine, and “/var/lib/pgadmin/sessions” is the session settings folder in the pgAdmin container.\\nBefore running docker-compose up on the YAML file, we also need to give the pgAdmin container access to write to the “pgAdmin_data” folder. The container runs with a username called “5050” and user group “5050”. The bash command to give access over the mounted volume is:\\nsudo chown -R 5050:5050 pgAdmin_data',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Persist PGAdmin configuration'},\n",
       "  {'text': 'This happens if you did not create the docker group and added your user. Follow these steps from the link:\\nguides/docker-without-sudo.md at main · sindresorhus/guides · GitHub\\nAnd then press ctrl+D to log-out and log-in again. pgAdmin: Maintain state so that it remembers your previous connection\\nIf you are tired of having to setup your database connection each time that you fire up the containers, all you have to do is create a volume for pgAdmin:\\nIn your docker-compose.yaml file, enter the following into your pgAdmin declaration:\\nvolumes:\\n- type: volume\\nsource: pgadmin_data\\ntarget: /var/lib/pgadmin\\nAlso add the following to the end of the file:ls\\nvolumes:\\nPgadmin_data:',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - dial unix /var/run/docker.sock: connect: permission denied'},\n",
       "  {'text': 'This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - docker-compose still not available after changing .bashrc'},\n",
       "  {'text': 'Installing pass via ‘sudo apt install pass’ helped to solve the issue. More about this can be found here: https://github.com/moby/buildkit/issues/1078',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Error getting credentials after running docker-compose up -d'},\n",
       "  {'text': \"For everyone who's having problem with Docker compose, getting the data in postgres and similar issues, please take care of the following:\\ncreate a new volume on docker (either using the command line or docker desktop app)\\nmake the following changes to your docker-compose.yml file (see attachment)\\nset low_memory=false when importing the csv file (df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=1000, low_memory=False))\\nuse the below function (in the upload-data.ipynb) for better tracking of your ingestion process (see attachment)\\nOrder of execution:\\n(1) open terminal in 2_docker_sql folder and run docker compose up\\n(2) ensure no other containers are running except the one you just executed (pgadmin and pgdatabase)\\n(3) open jupyter notebook and begin the data ingestion\\n(4) open pgadmin and set up a server (make sure you use the same configurations as your docker-compose.yml file like the same name (pgdatabase), port, databasename (ny_taxi) etc.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Errors pertaining to docker-compose.yml and pgadmin setup'},\n",
       "  {'text': 'Locate config.json file for docker (check your home directory; Users/username/.docker).\\nModify credsStore to credStore\\nSave and re-run',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker Compose up -d error getting credentials - err: exec: \"docker-credential-desktop\": executable file not found in %PATH%, out: ``'},\n",
       "  {'text': 'To figure out which docker-compose you need to download from https://github.com/docker/compose/releases you can check your system with these commands:\\nuname -s  -> return Linux most likely\\nuname -m -> return \"flavor\"\\nOr try this command -\\nsudo curl -L \"https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)\" -o /usr/local/bin/docker-compose',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Which docker-compose binary to use for WSL?'},\n",
       "  {'text': 'If you wrote the docker-compose.yaml file exactly like the video, you might run into an error like this:dev\\nservice \"pgdatabase\" refers to undefined volume dtc_postgres_volume_local: invalid compose project\\nIn order to make it work, you need to include the volume in your docker-compose file. Just add the following:\\nvolumes:\\ndtc_postgres_volume_local:\\n(Make sure volumes are at the same level as services.)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker-Compose - Error undefined volume in Windows/WSL'},\n",
       "  {'text': 'Error:  initdb: error: could not change permissions of directory\\nIssue: WSL and Windows do not manage permissions in the same way causing conflict if using the Windows file system rather than the WSL file system.\\nSolution: Use Docker volumes.\\nWhy: Volume is used for storage of persistent data and not for use of transferring files. A local volume is unnecessary.\\nBenefit: This resolves permission issues and allows for better management of volumes.\\nNOTE: the ‘user:’ is not necessary if using docker volumes, but is if using local drive.\\n</>  docker-compose.yaml\\nservices:\\npostgres:\\nimage: postgres:15-alpine\\ncontainer_name: postgres\\nuser: \"0:0\"\\nenvironment:\\n- POSTGRES_USER=postgres\\n- POSTGRES_PASSWORD=postgres\\n- POSTGRES_DB=ny_taxi\\nvolumes:\\n- \"pg-data:/var/lib/postgresql/data\"\\nports:\\n- \"5432:5432\"\\nnetworks:\\n- pg-network\\npgadmin:\\nimage: dpage/pgadmin4\\ncontainer_name: pgadmin\\nuser: \"${UID}:${GID}\"\\nenvironment:\\n- PGADMIN_DEFAULT_EMAIL=email@some-site.com\\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\\nvolumes:\\n- \"pg-admin:/var/lib/pgadmin\"\\nports:\\n- \"8080:80\"\\nnetworks:\\n- pg-network\\nnetworks:\\npg-network:\\nname: pg-network\\nvolumes:\\npg-data:\\nname: ingest_pgdata\\npg-admin:\\nname: ingest_pgadmin',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'WSL Docker directory permissions error'},\n",
       "  {'text': 'Cause : If Running on git bash or vm in windows pgadmin doesnt work easily LIbraries like psycopg2 and libpq ar required still the error persists.\\nSolution- I use psql instead of pgadmin totally same\\nPip install psycopg2\\ndock',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Docker - If pgadmin is not working for Querying in Postgres Use PSQL'},\n",
       "  {'text': 'Cause:\\nIt happens because the apps are not updated. To be specific, search for any pending updates for Windows Terminal, WSL and Windows Security updates.\\nSolution\\nfor updating Windows terminal which worked for me:\\nGo to Microsoft Store.\\nGo to the library of apps installed in your system.\\nSearch for Windows terminal.\\nUpdate the app and restart your system to  see the changes.\\nFor updating the Windows security updates:\\nGo to Windows updates and check if there are any pending updates from Windows, especially security updates.\\nDo restart your system once the updates are downloaded and installed successfully.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'WSL - Insufficient system resources exist to complete the requested service.'},\n",
       "  {'text': 'Up restardoting the same issue appears. Happens out of the blue on windows.\\nSolution 1: Fixing DNS Issue (credit: reddit) this worked for me personally\\nreg add \"HKLM\\\\System\\\\CurrentControlSet\\\\Services\\\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"4\" /f\\nRestart your computer and then enable it with the following\\nreg add \"HKLM\\\\System\\\\CurrentControlSet\\\\Services\\\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"2\" /f\\nRestart your OS again. It should work.\\nSolution 2: right click on running Docker icon (next to clock) and chose \"Switch to Linux containers\"\\nbash: conda: command not found\\nDatabase is uninitialized and superuser password is not specified.\\nDatabase is uninitialized and superuser password is not specified.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'WSL - WSL integration with distro Ubuntu unexpectedly stopped with exit code 1.'},\n",
       "  {'text': 'Issue when trying to run the GPC VM through SSH through WSL2,  probably because WSL2 isn’t looking for .ssh keys in the correct folder. My case I was trying to run this command in the terminal and getting an error\\nPC:/mnt/c/Users/User/.ssh$ ssh -i gpc [username]@[my external IP]\\nYou can try to use sudo before the command\\nSudo .ssh$ ssh -i gpc [username]@[my external IP]\\nYou can also try to cd to your folder and change the permissions for the private key SSH file.\\nchmod 600 gpc\\nIf that doesn’t work, create a .ssh folder in the home diretory of WSL2 and copy the content of windows .ssh folder to that new folder.\\ncd ~\\nmkdir .ssh\\ncp -r /mnt/c/Users/YourUsername/.ssh/* ~/.ssh/\\nYou might need to adjust the permissions of the files and folders in the .ssh directory.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'WSL - Permissions too open at Windows'},\n",
       "  {'text': 'Such as the issue above, WSL2 may not be referencing the correct .ssh/config path from Windows. You can create a config file at the home directory of WSL2.\\ncd ~\\nmkdir .ssh\\nCreate a config file in this new .ssh/ folder referencing this folder:\\nHostName [GPC VM external IP]\\nUser [username]\\nIdentityFile ~/.ssh/[private key]',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'WSL - Could not resolve host name'},\n",
       "  {'text': 'Change TO Socket\\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi\\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - connection failed: :1), port 5432 failed: could not receive data from server: Connection refused could not send SSL negotiation packet: Connection refused'},\n",
       "  {'text': 'probably some installation error, check out sy',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI --help error'},\n",
       "  {'text': 'In this section of the course, the 5432 port of pgsql is mapped to your computer’s 5432 port. Which means you can access the postgres database via pgcli directly from your computer.\\nSo No, you don’t need to run it inside another container. Your local system will do.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - INKhould we run pgcli inside another docker container?'},\n",
       "  {'text': 'FATAL:  password authentication failed for user \"root\"\\nobservations: Below in bold do not forget the folder that was created ny_taxi_postgres_data\\nThis happens if you have a local Postgres installation in your computer. To mitigate this, use a different port, like 5431, when creating the docker container, as in: -p 5431: 5432\\nThen, we need to use this port when connecting to pgcli, as shown below:\\npgcli -h localhost -p 5431 -u root -d ny_taxi\\nThis will connect you to your postgres docker container, which is mapped to your host’s 5431 port (though you might choose any port of your liking as long as it is not occupied).\\nFor a more visual and detailed explanation, feel free to check the video 1.4.2 - Port Mapping and Networks in Docker\\nIf you want to debug: the following can help (on a MacOS)\\nTo find out if something is blocking your port (on a MacOS):\\nYou can use the lsof command to find out which application is using a specific port on your local machine. `lsof -i :5432`wi\\nOr list the running postgres services on your local machine with launchctl\\nTo unload the running service on your local machine (on a MacOS):\\nunload the launch agent for the PostgreSQL service, which will stop the service and free up the port  \\n`launchctl unload -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\\nthis one to start it again\\n`launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\\nChanging port from 5432:5432 to 5431:5432 helped me to avoid this error.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - FATAL: password authentication failed for user \"root\" (You already have Postgres)'},\n",
       "  {'text': 'I get this error\\npgcli -h localhost -p 5432 -U root -d ny_taxi\\nTraceback (most recent call last):\\nFile \"/opt/anaconda3/bin/pgcli\", line 8, in <module>\\nsys.exit(cli())\\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1128, in __call__\\nreturn self.main(*args, **kwargs)\\nFile \"/opt/anaconda3/lib/python3.9/sitYe-packages/click/core.py\", line\\n1053, in main\\nrv = self.invoke(ctx)\\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1395, in invoke\\nreturn ctx.invoke(self.callback, **ctx.params)\\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 754, in invoke\\nreturn __callback(*args, **kwargs)\\nFile \"/opt/anaconda3/lib/python3.9/site-packages/pgcli/main.py\", line 880, in cli\\nos.makedirs(config_dir)\\nFile \"/opt/anaconda3/lib/python3.9/os.py\", line 225, in makedirspython\\nmkdir(name, mode)PermissionError: [Errno 13] Permission denied: \\'/Users/vray/.config/pgcli\\'\\nMake sure you install pgcli without sudo.\\nThe recommended approach is to use conda/anaconda to make sure your system python is not affected.\\nIf conda install gets stuck at \"Solving environment\" try these alternatives: https://stackoverflow.com/questions/63734508/stuck-at-solving-environment-on-anaconda',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"PGCLI - PermissionError: [Errno 13] Permission denied: '/some/path/.config/pgcli'\"},\n",
       "  {'text': 'ImportError: no pq wrapper available.\\nAttempts made:\\n- couldn\\'t import \\\\dt\\nopg \\'c\\' implementation: No module named \\'psycopg_c\\'\\n- couldn\\'t import psycopg \\'binary\\' implementation: No module named \\'psycopg_binary\\'\\n- couldn\\'t import psycopg \\'python\\' implementation: libpq library not found\\nSolution:\\nFirst, make sure your Python is set to 3.9, at least.\\nAnd the reason for that is we have had cases of \\'psycopg2-binary\\' failing to install because of an old version of Python (3.7.3). \\n\\n0. You can check your current python version with: \\n$ python -V(the V must be capital)\\n1. Based on the previous output, if you\\'ve got a 3.9, skip to Step #2\\n   Otherwispye better off with a new environment with 3.9\\n$ conda create –name de-zoomcamp python=3.9\\n$ conda activate de-zoomcamp\\n2. Next, you should be able to install the lib for postgres like this:\\n```\\n$ e\\n$ pip install psycopg2_binary\\n```\\n3. Finally, make sure you\\'re also installing pgcli, but use conda for that:\\n```\\n$ pgcli -h localhost -U root -d ny_taxisudo\\n```\\nThere, you should be good to go now!\\nAnother solution:\\nRun this\\npip install \"psycopg[binary,pool]\"',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - no pq wrapper available.'},\n",
       "  {'text': 'If your Bash prompt is stuck on the password command for postgres\\nUse winpty:\\nwinpty pgcli -h localhost -p 5432 -u root -d ny_taxi\\nAlternatively, try using Windows terminal or terminal in VS code.\\nEditPGCLI -connection failed: FATAL:  password authentication failed for user \"root\"\\nThe error above was faced continually despite inputting the correct password\\nSolution\\nOption 1: Stop the PostgreSQL service on Windows\\nOption 2 (using WSL): Completely uninstall Protgres 12 from Windows and install postgresql-client on WSL (sudo apt install postgresql-client-common postgresql-client libpq-dev)\\nOption 3: Change the port of the docker container\\nNEW SOLUTION: 27/01/2024\\nPGCLI -connection failed: FATAL:  password authentication failed for user \"root\"\\nIf you’ve got the error above, it’s probably because you were just like me, closed the connection to the Postgres:13 image in the previous step of the tutorial, which is\\n\\ndocker run -it \\\\\\n-e POSTGRES_USER=root \\\\\\n-e POSTGRES_PASSWORD=root \\\\\\n-e POSTGRES_DB=ny_taxi \\\\\\n-v d:/git/data-engineering-zoomcamp/week_1/docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \\\\\\n-p 5432:5432 \\\\\\npostgres:13\\nSo keep the database connected and you will be able to implement all the next steps of the tutorial.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI -  stuck on password prompt'},\n",
       "  {'text': 'Problem: If you have already installed pgcli but bash doesn\\'t recognize pgcli\\nOn Git bash: bash: pgcli: command not found\\nOn Windows Terminal: pgcli: The term \\'pgcli\\' is not recognized…\\nSolution: Try adding a Python path C:\\\\Users\\\\...\\\\AppData\\\\Roaming\\\\Python\\\\Python39\\\\Scripts to Windows PATH\\nFor details:\\nGet the location: pip list -v\\nCopy C:\\\\Users\\\\...\\\\AppData\\\\Roaming\\\\Python\\\\Python39\\\\site-packages\\n3. Replace site-packages with Scripts: C:\\\\Users\\\\...\\\\AppData\\\\Roaming\\\\Python\\\\Python39\\\\Scripts\\nIt can also be that you have Python installed elsewhere.\\nFor me it was under c:\\\\python310\\\\lib\\\\site-packages\\nSo I had to add c:\\\\python310\\\\lib\\\\Scripts to PATH, as shown below.\\nPut the above path in \"Path\" (or \"PATH\") in System Variables\\nReference: https://stackoverflow.com/a/68233660',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - pgcli: command not found'},\n",
       "  {'text': 'In case running pgcli  locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\\nBelow the usage with values used in the videos of the course for:\\nnetwork name (docker network)\\npostgres related variables for pgcli\\nHostname\\nUsername\\nPort\\nDatabase name\\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\\nPassword for root:\\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\\nVersion: 4.0.1\\nHome: http://pgcli.com\\nroot@pg-database:ny_taxi> \\\\dt\\n+--------+------------------+-------+-------+\\n| Schema | Name             | Type  | Owner |\\n|--------+------------------+-------+-------|\\n| public | yellow_taxi_data | table | root  |\\n+--------+------------------+-------+-------+\\nSELECT 1\\nTime: 0.009s\\nroot@pg-database:ny_taxi>',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - running in a Docker container'},\n",
       "  {'text': 'PULocationID will not be recognized but “PULocationID” will be. This is because unquoted \"Localidentifiers are case insensitive. See docs.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - case sensitive use “Quotations” around columns with capital letters'},\n",
       "  {'text': 'When using the command `\\\\d <database name>` you get the error column `c.relhasoids does not exist`.\\nResolution:\\nUninstall pgcli\\nReinstall pgclidatabase \"ny_taxi\" does not exist\\nRestart pc',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'PGCLI - error column c.relhasoids does not exist'},\n",
       "  {'text': \"This happens while uploading data via the connection in jupyter notebook\\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\\nThe port 5432 was taken by another postgres. We are not connecting to the port in docker, but to the port on our machine. Substitute 5431 or whatever port you mapped to for port 5432.\\nAlso if this error is still persistent , kindly check if you have a service in windows running postgres , Stopping that service will resolve the issue\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  password authentication failed for user \"root\"'},\n",
       "  {'text': 'Can happen when connecting via pgcli\\npgcli -h localhost -p 5432 -U root -d ny_taxi\\nOr while uploading data via the connection in jupyter notebook\\nengine = create_engine(\\'postgresql://root:root@localhost:5432/ny_taxi\\')\\nThis can happen when Postgres is already installed on your computer. Changing the port can resolve that (e.g. from 5432 to 5431).\\nTo check whether there even is a root user with the ability to login:\\nTry: docker exec -it <your_container_name> /bin/bash\\nAnd then run\\n???\\nAlso, you could change port from 5432:5432 to 5431:5432\\nOther solution that worked:\\nChanging `POSTGRES_USER=juroot` to `PGUSER=postgres`\\nBased on this: postgres with docker compose gives FATAL: role \"root\" does not exist error - Stack Overflow\\nAlso `docker compose down`, removing folder that had postgres volume, running `docker compose up` again.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  role \"root\" does not exist'},\n",
       "  {'text': '~\\\\anaconda3\\\\lib\\\\site-packages\\\\psycopg2\\\\__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)\\n120\\n121     dsn = _ext.make_dsn(dsn, **kwargs)\\n--> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\\n123     if cursor_factory is not None:\\n124         conn.cursor_factory = cursor_factory\\nOperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  database \"ny_taxi\" does not exist\\nMake sure postgres is running. You can check that by running `docker ps`\\n✅Solution: If you have postgres software installed on your computer before now, build your instance on a different port like 8080 instead of 5432',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  dodatabase \"ny_taxi\" does not exist'},\n",
       "  {'text': \"Issue:\\ne…\\nSolution:\\npip install psycopg2-binary\\nIf you already have it, you might need to update it:\\npip install psycopg2-binary --upgrade\\nOther methods, if the above fails:\\nif you are getting the “ ModuleNotFoundError: No module named 'psycopg2' “ error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e\\nFirst uninstall the psycopg package\\nThen update conda or pip\\nThen install psycopg again using pip.\\nif you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Postgres - ModuleNotFoundError: No module named 'psycopg2'\"},\n",
       "  {'text': 'In the join queries, if we mention the column name directly or enclosed in single quotes it’ll throw an error says “column does not exist”.\\n✅Solution: But if we enclose the column names in double quotes then it will work',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Postgres - \"Column does not exist\" but it actually does (Pyscopg2 error in MacBook Pro M2)'},\n",
       "  {'text': 'pgAdmin has a new version. Create server dialog may not appear. Try using register-> server instead.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'pgAdmin - Create server dialog does not appear'},\n",
       "  {'text': 'Using GitHub Codespaces in the browser resulted in a blank screen after the login to pgAdmin (running in a Docker container). The terminal of the pgAdmin container was showing the following error message:\\nCSRFError: 400 Bad Request: The referrer does not match the host.\\nSolution #1:\\nAs recommended in the following issue  https://github.com/pgadmin-org/pgadmin4/issues/5432 setting the following environment variable solved it.\\nPGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\"\\nModified “docker run” command\\ndocker run --rm -it \\\\\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\\\\n-e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\\\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\\\\n-p \"8080:80\" \\\\\\n--name pgadmin \\\\\\n--network=pg-network \\\\\\ndpage/pgadmin4:8.2\\nSolution #2:\\nUsing the local installed VSCode to display GitHub Codespaces.\\nWhen using GitHub Codespaces in the locally installed VSCode (opening a Codespace or creating/starting one) this issue did not occur.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'pgAdmin - Blank/white screen after login (browser)'},\n",
       "  {'text': 'I am using a Mac Pro device and connect to the GCP Compute Engine via Remote SSH - VSCode. But when I trying to run the PgAdmin container via docker run or docker compose command, I am failed to access the pgAdmin address via my browser. I have switched to another browser, but still can not access the pgAdmin address. So I modified a little bit the configuration from the previous DE Zoomcamp repository like below and can access the pgAdmin address:\\nSolution #1:\\nModified “docker run” command\\ndocker run --rm -it \\\\\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\\\\n-e PGADMIN_DEFAULT_PASSWORD=\"pgadmin\" \\\\\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\\\\n-e PGADMIN_LISTEN_ADDRESS=0.0.0.0 \\\\\\n-e PGADMIN_LISTEN_PORT=5050 \\\\\\n-p 5050:5050 \\\\\\n--network=de-zoomcamp-network \\\\\\n--name pgadmin-container \\\\\\n--link postgres-container \\\\\\n-t dpage/pgadmin4\\nSolution #2:\\nModified docker-compose.yaml configuration (via “docker compose up” command)\\npgadmin:\\nimage: dpage/pgadmin4\\ncontainer_name: pgadmin-conntainer\\nenvironment:\\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\\n- PGADMIN_CONFIG_WTF_CSRF_ENABLED=False\\n- PGADMIN_LISTEN_ADDRESS=0.0.0.0\\n- PGADMIN_LISTEN_PORT=5050\\nvolumes:\\n- \"./pgadmin_data:/var/lib/pgadmin/data\"\\nports:\\n- \"5050:5050\"\\nnetworks:\\n- de-zoomcamp-network\\ndepends_on:\\n- postgres-conntainer\\nPython - ModuleNotFoundError: No module named \\'pysqlite2\\'\\nImportError: DLL load failed while importing _sqlite3: The specified module could not be found. ModuleNotFoundError: No module named \\'pysqlite2\\'\\nThe issue seems to arise from the missing of sqlite3.dll in path \".\\\\Anaconda\\\\Dlls\\\\\".\\n✅I solved it by simply copying that .dll file from \\\\Anaconda3\\\\Library\\\\bin and put it under the path mentioned above. (if you are using anaconda)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'pgAdmin - Can not access/open the PgAdmin address via browser'},\n",
       "  {'text': 'If you follow the video 1.2.2 - Ingesting NY Taxi Data to Postgres and you execute all the same\\nsteps as Alexey does, you will ingest all the data (~1.3 million rows) into the table yellow_taxi_data as expected.\\nHowever, if you try to run the whole script in the Jupyter notebook for a second time from top to bottom, you will be missing the first chunk of 100000 records. This is because there is a call to the iterator before the while loop that puts the data in the table. The while loop therefore starts by ingesting the second chunk, not the first.\\n✅Solution: remove the cell “df=next(df_iter)” that appears higher up in the notebook than the while loop. The first time w(df_iter) is called should be within the while loop.\\n📔Note: As this notebook is just used as a way to test the code, it was not intended to be run top to bottom, and the logic is tidied up in a later step when it is instead inserted into a .py file for the pipeline',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Python - Ingestion with Jupyter notebook - missing 100000 records'},\n",
       "  {'text': '{t_end - t_start} seconds\")\\nimport pandas as pd\\ndf = pd.read_csv(\\'path/to/file.csv.gz\\', /app/ingest_data.py:1: DeprecationWarning:)\\nIf you prefer to keep the uncompressed csv (easier preview in vscode and similar), gzip files can be unzipped using gunzip (but not unzip). On a Ubuntu local or virtual machine, you may need to apt-get install gunzip first.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Python - Iteration csv without error'},\n",
       "  {'text': \"Pandas can interpret “string” column values as “datetime” directly when reading the CSV file using “pd.read_csv” using the parameter “parse_dates”, which for example can contain a list of column names or column indices. Then the conversion afterwards is not required anymore.\\npandas.read_csv — pandas 2.1.4 documentation (pydata.org)\\nExample from week 1\\nimport pandas as pd\\ndf = pd.read_csv(\\n'yellow_tripdata_2021-01.csv',\\nnrows=100,\\nparse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\\ndf.info()\\nwhich will output\\n<class 'pandas.core.frame.DataFrame'>\\nRangeIndex: 100 entries, 0 to 99\\nData columns (total 18 columns):\\n#   Column                 Non-Null Count  Dtype\\n---  ------                 --------------  -----\\n0   VendorID               100 non-null    int64\\n1   tpep_pickup_datetime   100 non-null    datetime64[ns]\\n2   tpep_dropoff_datetime  100 non-null    datetime64[ns]\\n3   passenger_count        100 non-null    int64\\n4   trip_distance          100 non-null    float64\\n5   RatecodeID             100 non-null    int64\\n6   store_and_fwd_flag     100 non-null    object\\n7   PULocationID           100 non-null    int64\\n8   DOLocationID           100 non-null    int64\\n9   payment_type           100 non-null    int64\\n10  fare_amount            100 non-null    float64\\n11  extra                  100 non-null    float64\\n12  mta_tax                100 non-null    float64\\n13  tip_amount             100 non-null    float64\\n14  tolls_amount           100 non-null    float64\\n15  improvement_surcharge  100 non-null    float64\\n16  total_amount           100 non-null    float64\\n17  congestion_surcharge   100 non-null    float64\\ndtypes: datetime64[ns](2), float64(9), int64(6), object(1)\\nmemory usage: 14.2+ KB\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'iPython - Pandas parsing dates with ‘read_csv’'},\n",
       "  {'text': 'os.system(f\"curl -LO {url} -o {csv_name}\")',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Python - Python cant ingest data from the github link provided using curl'},\n",
       "  {'text': 'When a CSV file is compressed using Gzip, it is saved with a \".csv.gz\" file extension. This file type is also known as a Gzip compressed CSV file. When you want to read a Gzip compressed CSV file using Pandas, you can use the read_csv() function, which is specifically designed to read CSV files. The read_csv() function accepts several parameters, including a file path or a file-like object. To read a Gzip compressed CSV file, you can pass the file path of the \".csv.gz\" file as an argument to the read_csv() function.\\nHere is an example of how to read a Gzip compressed CSV file using Pandas:\\ndf = pd.read_csv(\\'file.csv.gz\\'\\n, compression=\\'gzip\\'\\n, low_memory=False\\n)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Python - Pandas can read *.csv.gzip'},\n",
       "  {'text': \"Contrary to panda’s read_csv method there’s no such easy way to iterate through and set chunksize for parquet files. We can use PyArrow (Apache Arrow Python bindings) to resolve that.\\nimport pyarrow.parquet as pq\\noutput_name = “https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet”\\nparquet_file = pq.ParquetFile(output_name)\\nparquet_size = parquet_file.metadata.num_rows\\nengine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')\\ntable_name=”yellow_taxi_schema”\\n# Clear table if exists\\npq.read_table(output_name).to_pandas().head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')\\n# default (and max) batch size\\nindex = 65536\\nfor i in parquet_file.iter_batches(use_threads=True):\\nt_start = time()\\nprint(f'Ingesting {index} out of {parquet_size} rows ({index / parquet_size:.0%})')\\ni.to_pandas().to_sql(name=table_name, con=engine, if_exists='append')\\nindex += 65536\\nt_end = time()\\nprint(f'\\\\t- it took %.1f seconds' % (t_end - t_start))\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Python - How to iterate through and ingest parquet file'},\n",
       "  {'text': 'Error raised during the jupyter notebook’s cell execution:\\nfrom sqlalchemy import create_engine.\\nSolution: Version of Python module “typing_extensions” >= 4.6.0. Can be updated by Conda or pip.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Python - SQLAlchemy - ImportError: cannot import name 'TypeAliasType' from 'typing_extensions'.\"},\n",
       "  {'text': 'create_engine(\\'postgresql://root:root@localhost:5432/ny_taxi\\')  I get the error \"TypeError: \\'module\\' object is not callable\"\\nSolution:\\nconn_string = \"postgresql+psycopg://root:root@localhost:5432/ny_taxi\"\\nengine = create_engine(conn_string)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Python - SQLALchemy - TypeError 'module' object is not callable\"},\n",
       "  {'text': \"Error raised during the jupyter notebook’s cell execution:\\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi').\\nSolution: Need to install Python module “psycopg2”. Can be installed by Conda or pip.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"Python - SQLAlchemy - ModuleNotFoundError: No module named 'psycopg2'.\"},\n",
       "  {'text': 'Unable to add Google Cloud SDK PATH to Windows\\nWindows error: The installer is unable to automatically update your system PATH. Please add  C:\\\\tools\\\\google-cloud-sdk\\\\bin\\nif you are constantly getting this feedback. Might be that you needed to add Gitbash to your Windows path:\\nOne way of doing that is to use conda: ‘If you are not already using it\\nDownload the Anaconda Navigator\\nMake sure to check the box (add conda to the path when installing navigator: although not recommended do it anyway)\\nYou might also need to install git bash if you are not already using it(or you might need to uninstall it to reinstall it properly)\\nMake sure to check the following boxes while you install Gitbash\\nAdd a GitBash to Windows Terminal\\nUse Git and optional Unix tools from the command prompt\\nNow open up git bash and type conda init bash This should modify your bash profile\\nAdditionally, you might want to use Gitbash as your default terminal.\\nOpen your Windows terminal and go to settings, on the default profile change Windows power shell to git bash',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - Unable to add Google Cloud SDK PATH to Windows'},\n",
       "  {'text': 'It asked me to create a project. This should be done from the cloud console. So maybe we don’t need this FAQ.\\nWARNING: Project creation failed: HttpError accessing <https://cloudresourcemanager.googleapis.com/v1/projects?alt=json>: response: <{\\'vtpep_pickup_datetimeary\\': \\'Origin, X-Origin, Referer\\', \\'content-type\\': \\'application/json; charset=UTF-8\\', \\'content-encoding\\': \\'gzip\\', \\'date\\': \\'Mon, 24 Jan 2022 19:29:12 GMT\\', \\'server\\': \\'ESF\\', \\'cache-control\\': \\'private\\', \\'x-xss-protection\\': \\'0\\', \\'x-frame-options\\': \\'SAMEORIGIN\\', \\'x-content-type-options\\': \\'nosniff\\', \\'server-timing\\': \\'gfet4t7; dur=189\\', \\'alt-svc\\': \\'h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"\\', \\'transfer-encoding\\': \\'chunked\\', \\'status\\': 409}>, content <{\\n\"error\": {\\n\"code\": 409,\\n\"message\": \"Requested entity alreadytpep_pickup_datetime exists\",\\n\"status\": \"ALREADY_EXISTS\"\\n}\\n}\\nFrom Stackoverflow: https://stackoverflow.com/questions/52561383/gcloud-cli-cannot-create-project-the-project-id-you-specified-is-already-in-us?rq=1\\nProject IDs are unique across all projects. That means if any user ever had a project with that ID, you cannot use it. testproject is pretty common, so it\\'s not surprising it\\'s already taken.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - Project creation failed: HttpError accessing … Requested entity alreadytpep_pickup_datetime exists'},\n",
       "  {'text': 'If you receive the error: “Error 403: The project to be billed is associated with an absent billing account., accountDisabled” It is most likely because you did not enter YOUR project ID. The snip below is from video 1.3.2\\nThe value you enter here will be unique to each student. You can find this value on your GCP Dashboard when you login.\\nAshish Agrawal\\nAnother possibility is that you have not linked your billing account to your current project',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - The project to be billed is associated with an absent billing account'},\n",
       "  {'text': 'GCP Account Suspension Inquiry\\nIf Google refuses your credit/debit card, try another - I’ve got an issue with Kaspi (Kazakhstan) but it worked with TBC (Georgia).\\nUnfortunately, there’s small hope that support will help.\\nIt seems that Pyypl web-card should work too.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - OR-CBAT-15 ERROR Google cloud free trial account'},\n",
       "  {'text': 'The ny-rides.json is your private file in Google Cloud Platform (GCP). \\n\\nAnd here’s the way to find it:\\nGCP -> Select project with your  instance -> IAM & Admin -> Service Accounts Keys tab -> add key, JSON as key type, then click create\\nNote: Once you go into Service Accounts Keys tab, click the email, then you can see the “KEYS” tab where you can add key as a JSON as its key type',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - Where can I find the “ny-rides.json” file?'},\n",
       "  {'text': 'In this lecture, Alexey deleted his instance in Google Cloud. Do I have to do it?\\nNope. Do not delete your instance in Google Cloud platform. Otherwise, you have to do this twice for the week 1 readings.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - Do I need to delete my instance in Google Cloud?'},\n",
       "  {'text': 'System Resource Usage:\\ntop or htop: Shows real-time information about system resource usage, including CPU, memory, and processes.\\nfree -h: Displays information about system memory usage and availability.\\ndf -h: Shows disk space usage of file systems.\\ndu -h <directory>: Displays disk usage of a specific directory.\\nRunning Processes:\\nps aux: Lists all running processes along with detailed information.\\nNetwork:\\nifconfig or ip addr show: Shows network interface configuration.\\nnetstat -tuln: Displays active network connections and listening ports.\\nHardware Information:\\nlscpu: Displays CPU information.\\nlsblk: Lists block devices (disks and partitions).\\nlshw: Lists hardware configuration.\\nUser and Permissions:\\nwho: Shows who is logged on and their activities.\\nw: Displays information about currently logged-in users and their processes.\\nPackage Management:\\napt list --installed: Lists installed packages (for Ubuntu and Debian-based systems)',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Commands to inspect the health of your VM:'},\n",
       "  {'text': 'if you’ve got the error\\n│ Error: Error updating Dataset \"projects/<your-project-id>/datasets/demo_dataset\": googleapi: Error 403: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. The default table expiration time must be less than 60 days, billingNotEnabled\\nbut you’ve set your billing account indeed, then try to disable billing for the project and enable it again. It worked for ME!',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Billing account has not been enabled for this project. But you’ve done it indeed!'},\n",
       "  {'text': 'for windows if you having trouble install SDK try follow these steps on the link, if you getting this error:\\nThese credentials will be used by any library that requests Application Default Credentials (ADC).\\nWARNING:\\nCannot find a quota project to add to ADC. You might receive a \"quota exceeded\" or \"API not enabled\" error. Run $ gcloud auth application-default set-quota-project to add a quota project.\\nFor me:\\nI reinstalled the sdk using unzip file “install.bat”,\\nafter successfully checking gcloud version,\\nrun gcloud init to set up project before\\nyou run gcloud auth application-default login\\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md\\nGCP VM - I cannot get my Virtual Machine to start because GCP has no resources.\\nClick on your VM\\nCreate an image of your VM\\nOn the page of the image, tell GCP to create a new VM instance via the image\\nOn the settings page, change the location',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP - Windows Google Cloud SDK install issue:gcp'},\n",
       "  {'text': 'The reason this video about the GCP VM exists is that many students had problems configuring their env. You can use your own env if it works for you.\\nAnd the advantage of using your own environment is that if you are working in a Github repo where you can commit, you will be able to commit the changes that you do. In the VM the repo is cloned via HTTPS so it is not possible to directly commit, even if you are the owner of the repo.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP VM - Is it necessary to use a GCP VM? When is it useful?'},\n",
       "  {'text': \"I am trying to create a directory but it won't let me do it\\nUser1@DESKTOP-PD6UM8A MINGW64 /\\n$ mkdir .ssh\\nmkdir: cannot create directory ‘.ssh’: Permission denied\\nYou should do it in your home directory. Should be your home (~)\\nLocal. But it seems you're trying to do it in the root folder (/). Should be your home (~)\\nLink to Video 1.4.1\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP VM - mkdir: cannot create directory ‘.ssh’: Permission denied'},\n",
       "  {'text': \"Failed to save '<file>': Unable to write file 'vscode-remote://ssh-remote+de-zoomcamp/home/<user>/data_engineering_course/week_2/airflow/dags/<file>' (NoPermissions (FileSystemError): Error: EACCES: permission denied, open '/home/<user>/data_engineering_course/week_2/airflow/dags/<file>')\\nYou need to change the owner of the files you are trying to edit via VS Code. You can run the following command to change the ownership.\\nssh\\nsudo chown -R <user> <path to your directory>\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP VM - Error while saving the file in VM via VS Code'},\n",
       "  {'text': 'Question: I connected to my VM perfectly fine last week (ssh) but when I tried again this week, the connection request keeps timing out.\\n✅Answer: Start your VM. Once the VM is running, copy its External IP and paste that into your config file within the ~/.ssh folder.\\ncd ~/.ssh\\ncode config ← this opens the config file in VSCode',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': '. GCP VM - VM connection request timeout'},\n",
       "  {'text': '(reference: https://serverfault.com/questions/953290/google-compute-engine-ssh-connect-to-host-ip-port-22-operation-timed-out)Go to edit your VM.\\nGo to section Automation\\nAdd Startup script\\n```\\n#!/bin/bash\\nsudo ufw allow ssh\\n```\\nStop and Start VM.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP VM -  connect to host port 22 no route to host'},\n",
       "  {'text': 'You can easily forward the ports of pgAdmin, postgres and Jupyter Notebook using the built-in tools in Ubuntu and without any additional client:\\nFirst, in the VM machine, launch docker-compose up -d and jupyter notebook in the correct folder.\\nFrom the local machine, execute: ssh -i ~/.ssh/gcp -L 5432:localhost:5432 username@external_ip_of_vm\\nExecute the same command but with ports 8080 and 8888.\\nNow you can access pgAdmin on local machine in browser typing localhost:8080\\nFor Jupyter Notebook, type localhost:8888 in the browser of your local machine. If you have problems with the credentials, it is possible that you have to copy the link with the access token provided in the logs of the terminal of the VM machine when you launched the jupyter notebook command.\\nTo forward both pgAdmin and postgres use, ssh -i ~/.ssh/gcp -L 5432:localhost:5432 -L 8080:localhost:8080 modito@35.197.218.128',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP VM - Port forwarding from GCP without using VS Code'},\n",
       "  {'text': 'If you are using MS VS Code and running gcloud in WSL2, when you first try to login to gcp via the gcloud cli gcloud auth application-default login, you will see a message like this, and nothing will happen\\nAnd there might be a prompt to ask if you want to open it via browser, if you click on it, it will open up a page with error message\\nSolution : you should instead hover on the long link, and ctrl + click the long link\\n\\nClick configure Trusted Domains here\\n\\nPopup will appear, pick first or second entry\\nNext time you gcloud auth, the login page should popup via default browser without issues',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'GCP gcloud + MS VS Code - gcloud auth hangs'},\n",
       "  {'text': 'It is an internet connectivity error, terraform is somehow not able to access the online registry. Check your VPN/Firewall settings (or just clear cookies or restart your network). Try terraform init again after this, it should work.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error: Failed to query available provider packages │ Could not retrieve the list of available versions for provider hashicorp/google: could not query │ provider registry for registry.terrafogorm.io/hashicorp/google: the request failed after 2 attempts, │ please try again later'},\n",
       "  {'text': \"The issue was with the network. Google is not accessible in my country, I am using a VPN. And The terminal program does not automatically follow the system proxy and requires separate proxy configuration settings.I opened a Enhanced Mode in Clash, which is a VPN app, and 'terraform apply' works! So if you encounter the same issue, you can ask help for your vpn provider.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error:Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=coherent-ascent-379901\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 172.217.163.42:443: i/o timeout'},\n",
       "  {'text': 'https://techcommunity.microsoft.com/t5/azure-developer-community-blog/configuring-terraform-on-windows-10-linux-sub-system/ba-p/393845',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Install for WSL'},\n",
       "  {'text': 'https://github.com/hashicorp/terraform/issues/14513',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error acquiring the state lock'},\n",
       "  {'text': 'When running\\nterraform apply\\non wsl2 I\\'ve got this error:\\n│ Error: Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=<your-project-id>\": oauth2: cannot fetch token: 400 Bad Request\\n│ Response: {\"error\":\"invalid_grant\",\"error_description\":\"Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.\"}\\nIT happens because there may be time desync on your machine which affects computing JWT\\nTo fix this, run the command\\nsudo hwclock -s\\nwhich fixes your system time.\\nReference',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error 400 Bad Request.  Invalid JWT Token  on WSL.'},\n",
       "  {'text': '│ Error: googleapi: Error 403: Access denied., forbidden\\nYour $GOOGLE_APPLICATION_CREDENTIALS might not be pointing to the correct file \\nrun = export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/YOUR_JSON.json\\nAnd then = gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error 403 : Access denied'},\n",
       "  {'text': \"One service account is enough for all the services/resources you'll use in this course. After you get the file with your credentials and set your environment variable, you should be good to go.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Do I need to make another service account for terraform before I get the keys (.json file)?'},\n",
       "  {'text': 'Here: https://releases.hashicorp.com/terraform/1.1.3/terraform_1.1.3_linux_amd64.zip',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Where can I find the Terraform 1.1.3 Linux (AMD 64)?'},\n",
       "  {'text': 'You get this error because I run the command terraform init outside the working directory, and this is wrong.You need first to navigate to the working directory that contains terraform configuration files, and and then run the command.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Terraform initialized in an empty directory! The directory has no Terraform configuration files. You may begin working with Terraform immediately by creating Terraform configuration files.g'},\n",
       "  {'text': 'The error:\\nError: googleapi: Error 403: Access denied., forbidden\\n│\\nand\\n│ Error: Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes.\\nFor this solution make sure to run:\\necho $GOOGLE_APPLICATION_CREDENTIALS\\necho $?\\nSolution:\\nYou have to set again the GOOGLE_APPLICATION_CREDENTIALS as Alexey did in the environment set-up video in week1:\\nexport GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes'},\n",
       "  {'text': \"The error:\\nError: googleapi: Error 403: terraform-trans-campus@trans-campus-410115.iam.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project. Permission 'storage.buckets.create' denied on resource (or it may not exist)., forbidden\\nThe solution:\\nYou have to declare the project name as your Project ID, and not your Project name, available on GCP console Dashboard.\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Terraform - Error creating Bucket: googleapi: Error 403: Permission denied to access ‘storage.buckets.create’'},\n",
       "  {'text': 'provider \"google\" {\\nproject     = var.projectId\\ncredentials = file(\"${var.gcpkey}\")\\n#region      = var.region\\nzone = var.zone\\n}',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'To ensure the sensitivity of the credentials file, I had to spend lot of time to input that as a file.'},\n",
       "  {'text': 'For the HW1 I encountered this issue. The solution is\\nSELECT * FROM zones AS z WHERE z.\"Zone\" = \\'Astoria Zone\\';\\nI think columns which start with uppercase need to go between “Column”. I ran into a lot of issues like this and “ ” made it work out.\\nAddition to the above point, for me, there is no ‘Astoria Zone’, only ‘Astoria’ is existing in the dataset.\\nSELECT * FROM zones AS z WHERE z.\"Zone\" = \\'Astoria’;',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"SQL - SELECT * FROM zones_taxi WHERE Zone='Astoria Zone'; Error Column Zone doesn't exist\"},\n",
       "  {'text': 'It is inconvenient to use quotation marks all the time, so it is better to put the data to the database all in lowercase, so in Pandas after\\ndf = pd.read_csv(‘taxi+_zone_lookup.csv’)\\nAdd the row:\\ndf.columns = df.columns.str.lower()',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"SQL - SELECT Zone FROM taxi_zones Error Column Zone doesn't exist\"},\n",
       "  {'text': 'Solution (for mac users): os.system(f\"curl {url} --output {csv_name}\")',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'CURL - curl: (6) Could not resolve host: output.csv'},\n",
       "  {'text': 'To resolve this, ensure that your config file is in C/User/Username/.ssh/config',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'SSH Error: ssh: Could not resolve hostname linux: Name or service not known'},\n",
       "  {'text': 'If you use Anaconda (recommended for the course), it comes with pip, so the issues is probably that the anaconda’s Python is not on the PATH.\\nAdding it to the PATH is different for each operation system.\\nFor Linux and MacOS:\\nOpen a terminal.\\nFind the path to your Anaconda installation. This is typically `~/anaconda3` or `~/opt/anaconda3`.\\nAdd Anaconda to your PATH with the command: `export PATH=\"/path/to/anaconda3/bin:$PATH\"`.\\nTo make this change permanent, add the command to your `.bashrc` (Linux) or `.bash_profile` (MacOS) file.\\nOn Windows, python and pip are in different locations (python is in the anaconda root, and pip is in Scripts). With GitBash:\\nLocate your Anaconda installation. The default path is usually `C:\\\\Users\\\\[YourUsername]\\\\Anaconda3`.\\nDetermine the correct path format for Git Bash. Paths in Git Bash follow the Unix-style, so convert the Windows path to a Unix-style path. For example, `C:\\\\Users\\\\[YourUsername]\\\\Anaconda3` becomes `/c/Users/[YourUsername]/Anaconda3`.\\nAdd Anaconda to your PATH with the command: `export PATH=\"/c/Users/[YourUsername]/Anaconda3/:/c/Users/[YourUsername]/Anaconda3/Scripts/$PATH\"`.\\nTo make this change permanent, add the command to your `.bashrc` file in your home directory.\\nRefresh your environment with the command: `source ~/.bashrc`.\\nFor Windows (without Git Bash):\\nRight-click on \\'This PC\\' or \\'My Computer\\' and select \\'Properties\\'.\\nClick on \\'Advanced system settings\\'.\\nIn the System Properties window, click on \\'Environment Variables\\'.\\nIn the Environment Variables window, select the \\'Path\\' variable in the \\'System variables\\' section and click \\'Edit\\'.\\nIn the Edit Environment Variable window, click \\'New\\' and add the path to your Anaconda installation (typically `C:\\\\Users\\\\[YourUsername]\\\\Anaconda3` and C:\\\\Users\\\\[YourUsername]\\\\Anaconda3\\\\Scripts`).\\nClick \\'OK\\' in all windows to apply the changes.\\nAfter adding Anaconda to the PATH, you should be able to use `pip` from the command line. Remember to restart your terminal (or command prompt in Windows) to apply these changes.',\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': \"'pip' is not recognized as an internal or external command, operable program or batch file.\"},\n",
       "  {'text': \"Resolution: You need to stop the services which is using the port.\\nRun the following:\\n```\\nsudo kill -9 `sudo lsof -t -i:<port>`\\n```\\n<port> being 8080 in this case. This will free up the port for use.\\n~ Abhijit Chakraborty\\nError: error response from daemon: cannot stop container: 1afaf8f7d52277318b71eef8f7a7f238c777045e769dd832426219d6c4b8dfb4: permission denied\\nResolution: In my case, I had to stop docker and restart the service to get it running properly\\nUse the following command:\\n```\\nsudo systemctl restart docker.socket docker.service\\n```\\n~ Abhijit Chakraborty\\nError: cannot import module psycopg2\\nResolution: Run the following command in linux:\\n```\\nsudo apt-get install libpq-dev\\npip install psycopg2\\n```\\n~ Abhijit Chakraborty\\nError: docker build Error checking context: 'can't stat '<path-to-file>'\\nResolution: This happens due to insufficient permission for docker to access a certain file within the directory which hosts the Dockerfile.\\n1. You can create a .dockerignore file and add the directory/file which you want Dockerfile to ignore while build.\\n2. If the above does not work, then put the dockerfile and corresponding script, `\\t1.py` in our case to a subfolder. and run `docker build ...`\\nfrom inside the new folder.\\n~ Abhijit Chakraborty\",\n",
       "   'section': 'Module 1: Docker and Terraform',\n",
       "   'question': 'Error: error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use'},\n",
       "  {'text': 'To get a pip-friendly requirements.txt file file from Anaconda use\\nconda install pip then `pip list –format=freeze > requirements.txt`.\\n`conda list -d > requirements.txt` will not work and `pip freeze > requirements.txt` may give odd pathing.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Anaconda to PIP'},\n",
       "  {'text': 'Prefect: https://docs.google.com/document/d/1K_LJ9RhAORQk3z4Qf_tfGQCDbu8wUWzru62IUscgiGU/edit?usp=sharing\\nAirflow: https://docs.google.com/document/d/1-BwPAsyDH_mAsn8HH5z_eNYVyBMAtawJRjHHsjEKHyY/edit?usp=sharing',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Where are the FAQ questions from the previous cohorts for the orchestration module?'},\n",
       "  {'text': 'Issue : Docker containers exit instantly with code 132, upon docker compose up\\nMage documentation has it listing the cause as \"older architecture\" .\\nThis might be a hardware issue, so unless you have another computer, you can\\'t solve it without purchasing a new one, so the next best solution is a VM.\\nThis is from a student running on a VirtualBox VM, Ubuntu 22.04.3 LTS, Docker version 25.0.2. So not having the context on how the vbox was spin up with (CPU, RAM, network, etc), it’s really inconclusive at this time.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Docker - 2.2.2 Configure Mage'},\n",
       "  {'text': 'This issue was occurring with Windows WSL 2\\nFor me this was because WSL 2 was not dedicating enough cpu cores to Docker.The load seems to take up at least one cpu core so I recommend dedicating at least two.\\nOpen Bash and run the following code:\\n$ cd ~\\n$ ls -la\\nLook for the .wsl config file:\\n-rw-r--r-- 1 ~1049089       31 Jan 25 12:54  .wslconfig\\nUsing a text editing tool of your choice edit or create your .wslconfig file:\\n$ nano .wslconfig\\nPaste the following into the new file/ edit the existing file in this format and save:\\n*** Note - for memory– this is the RAM on your machine you can dedicate to Docker, your situation may be different than mine ***\\n[wsl2]\\nprocessors=<Number of Processors - at least 2!> example: 4\\nmemory=<memory> example:4GB\\nExample:\\nOnce you do that run:\\n$ wsl --shutdown\\nThis shuts down WSL\\nThen Restart Docker Desktop - You should now be able to load the .csv.gz file without the error into a pandas dataframe',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'WSL - 2.2.3 Mage - Unexpected Kernel Restarts; Kernel Running out of memory:'},\n",
       "  {'text': 'The issue and solution on the link:\\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1706817366764269?thread_ts=1706815324.993529&cid=C01FABYF2RG',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': '2.2.3 Configuring Postgres'},\n",
       "  {'text': 'Check that the POSTGRES_PORT variable in the io_config.yml  file is set to port 5432, which is the default postgres port. The POSTGRES_PORT variable is the mage container port, not the host port. Hence, there’s no need to set the POSTGRES_PORT to 5431 just because you already have a conflicting postgres installation in your host machine.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'MAGE - 2.2.3 OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5431 failed: Connection refused'},\n",
       "  {'text': 'You forgot to select ‘dev’ profile in the dropdown menu next to where you select ‘PostgreSQL’ in the connection drop down.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'MAGE - 2.2.4 executing SELECT 1; results in KeyError'},\n",
       "  {'text': 'If you are getting this error. Update your mage io_config.yaml file, and specify a timeout value set to 600 like this.\\nMake sure to save your changes.\\nMAGE - 2.2.4 Testing BigQuery connection using SQL 404 error:\\nNotFound: 404 Not found: Dataset ny-rides-diegogutierrez:None was not found in location northamerica-northeast1\\nIf you get this error even with all roles/permissions given to the service account check if you have ticked the box where it says “Use raw SQL”, just like the image below.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': \"MAGE -2.2.4 ConnectionError: ('Connection aborted.', TimeoutError('The write operation timed out'))\"},\n",
       "  {'text': 'Solution: https://stackoverflow.com/questions/48056381/google-client-invalid-jwt-token-must-be-a-short-lived-token',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': \"Problem: RefreshError: ('invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.'})\"},\n",
       "  {'text': \"Origin of Solution (Mage Slack-Channel): https://mageai.slack.com/archives/C03HTTWFEKE/p1706543947795599\\nProblem: This error can often be seen after solving the error mentioned in 2.2.4. The error can be found in Mage version 0.9.61 and is a side-effect of the update of the code for data-loader blocks.\\nNote: Mage 0.9.62 has been released, as of Feb 5 2024. Please recheck. Solution below may be obsolete\\nSolution: Using a “fixed” version of the docker container\\nPull updated docker image from docker-hub\\nmageai/mageaidocker pull:alpha\\nUpdate docker-compose.yaml\\nversion: '3'\\nservices:\\nmagic:\\nimage: mageai/mageai:alpha  <--- instead of “latest”-tag\\ndocker-compose up\\nThe original Error is still present, but the SQL-query will return the desired result:\\n--------------------------------------------------------------------------------------\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage - 2.2.4 IndexError: list index out of range'},\n",
       "  {'text': 'Add\\nif not path.parent.is_dir():\\npath.parent.mkdir(parents=True)\\npath = Path(path).as_posix()\\nsee:\\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1675774214591809?thread_ts=1675768839.028879&cid=C01FABYF2RG',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': '2.2.6 OSError: Cannot save file into a non-existent directory: \\'..\\\\\\\\..\\\\\\\\data\\\\\\\\yellow\\'\\\\n\")'},\n",
       "  {'text': 'The video DE Zoomcamp 2.2.7 is missing  the actual deployment of Mage using Terraform to GCP. The steps for the deployment were not covered in the video.\\nI successfully deployed it and wanted to share some key points:\\nIn variables.tf, set the project_id default value to your GCP project ID.\\nEnable the Cloud Filestore API:\\nVisit the Google Cloud Console.to\\nNavigate to \"APIs & Services\" > \"Library.\"\\nSearch for \"Cloud Filestore API.\"\\nClick on the API and enable it.\\nTo perform the deployment:\\nterraform init\\nterraform apply\\nPlease note that during the terraform apply step, Terraform will prompt you to enter the PostgreSQL password. After that, it will ask for confirmation to proceed with the deployment. Review the changes, type \\'yes\\' when prompted, and press Enter.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'GCP - 2.2.7d Deploying Mage to GCP'},\n",
       "  {'text': 'If you want to rune multiple docker containers from different directories. Then make sure to change the port mappings in the docker-compose.yml file.\\nports:\\n- 8088:6789\\nThe 8088 port in above case is hostport, where mage will run on your local machine. You can customize this as long as the port is available. If you are running on VM, make sure to forward the port too. You need to keep the container port to 6789 as this is the port where mage is running.\\nGCP - 2.2.7d Deploying Mage to Google Cloud\\nWhile terraforming all the resources inside a VM created in GCS the following error is shown.\\nError log:\\nmodule.lb-http.google_compute_backend_service.default[\"default\"]: Creating...\\n╷\\n│ Error: Error creating GlobalAddress: googleapi: Error 403: Request had insufficient authentication scopes.\\n│ Details:\\n│ [\\n│   {\\n│     \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\\n│     \"domain\": \"googleapis.com\",\\n│     \"metadatas\": {\\n│       \"method\": \"compute.beta.GlobalAddressesService.Insert\",\\n│       \"service\": \"compute.googleapis.com\"\\n│     },\\n│     \"reason\": \"ACCESS_TOKEN_SCOPE_INSUFFICIENT\"\\n│   }\\n│ ]\\n│\\n│ More details:\\n│ Reason: insufficientPermissions, Message: Insufficient Permission\\nThis error might happen when you are using a VM inside GCS. To use the Google APIs from a GCP virtual machine you need to add the cloud platform scope (\"https://www.googleapis.com/auth/cloud-platform\") to your VM when it is created.\\nSince ours is already created you can just stop it and change the permissions. You can do it in the console, just go to \"EDIT\", g99o all the way down until you find \"Cloud API access scopes\". There you can \"Allow full access to all Cloud APIs\". I did this and all went smoothly generating all the resources needed. Hope it helps if you encounter this same error.\\nResources: https://stackoverflow.com/questions/35928534/403-request-had-insufficient-authentication-scopes-during-gcloud-container-clu',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Ruuning Multiple Mage instances in Docker from different directories'},\n",
       "  {'text': 'If you are on the free trial account on GCP you will face this issue when trying to deploy the infrastructures with terraform. This service is not available for this kind of account.\\nThe solution I found was to delete the load_balancer.tf file and to comment or delete the rows that differentiate it on the main.tf file. After this just do terraform destroy to delete any infrastructure created on the fail attempts and re-run the terraform apply.\\nCode on main.tf to comment/delete:\\nLine 166, 167, 168',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'GCP - 2.2.7d Load Balancer Problem (Security Policies quota)'},\n",
       "  {'text': \"If you get the following error\\nYou have to edit variables.tf on the gcp folder, set your project-id and region and zones properly. Then, run terraform apply again.\\nYou can find correct regions/zones here: https://cloud.google.com/compute/docs/regions-zones\\nDeploying MAGE to GCP  with Terraform via the VM (2.2.7)\\nFYI - It can take up to 20 minutes to deploy the MAGE Terraform files if you are using a GCP Virtual Machine. It is normal, so don’t interrupt the process or think it’s taking too long. If you have, make sure you run a terraform destroy before trying again as you will have likely partially created resources which will cause errors next time you run `terraform apply`.\\n`terraform destroy` may not completely delete partial resources - go to Google Cloud Console and use the search bar at the top to search for the ‘app.name’ you declared in your variables.tf file; this will list all resources with that name - make sure you delete them all before running `terraform apply` again.\\nWhy are my GCP free credits going so fast? MAGE .tf files - Terraform Destroy not destroying all Resources\\nI checked my GCP billing last night & the MAGE Terraform IaC didn't destroy a GCP Resource called Filestore as ‘mage-data-prep- it has been costing £5.01 of my free credits each day  I now have £151 left - Alexey has assured me that This amount WILL BE SUFFICIENT funds to finish the course. Note to anyone who had issues deploying the MAGE terraform code: check your billing account to see what you're being charged for (main menu - billing) (even if it's your free credits) and run a search for 'mage-data-prep' in the top bar just to be sure that your resources have been destroyed - if any come up delete them.\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'GCP - 2.2.7d Part 2 - Getting error when you run terraform apply'},\n",
       "  {'text': '```\\n│ Error: Error creating Connector: googleapi: Error 403: Permission \\'vpcaccess.connectors.create\\' denied on resource \\'//vpcaccess.googleapis.com/projects/<ommit>/locations/us-west1\\' (or it may not exist).\\n│ Details:\\n│ [\\n│   {\\n│     \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\\n│     \"domain\": \"vpcaccess.googleapis.com\",\\n│     \"metadata\": {\\n│       \"permission\": \"vpcaccess.connectors.create\",\\n│       \"resource\": \"projects/<ommit>/locations/us-west1\"\\n│     },\\n│     \"reason\": \"IAM_PERMISSION_DENIED\"\\n│   }\\n│ ]\\n│\\n│   with google_vpc_access_connector.connector,\\n│   on fs.tf line 19, in resource \"google_vpc_access_connector\" \"connector\":\\n│   19: resource \"google_vpc_access_connector\" \"connector\" {\\n│\\n```\\nSolution: Add Serverless VPC Access Admin to Service Account.\\nLine 148',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': \"Question: Permission 'vpcaccess.connectors.create'\"},\n",
       "  {'text': 'Git won’t push an empty folder to GitHub, so if you put a file in that folder and then push, then you should be good to go.\\nOr - in your code- make the folder if it doesn’t exist using Pathlib as shown here: https://stackoverflow.com/a/273227/4590385.\\nFor some reason, when using github storage, the relative path for writing locally no longer works. Try using two separate paths, one full path for the local write, and the original relative path for GCS bucket upload.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': \"File Path: Cannot save file into a non-existent directory: 'data/green'\"},\n",
       "  {'text': 'The green dataset contains lpep_pickup_datetime while the yellow contains tpep_pickup_datetime. Modify the script(s) depending on  the dataset as required.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'No column name lpep_pickup_datetime / tpep_pickup_datetime'},\n",
       "  {'text': 'pd.read_csv\\ndf_iter = pd.read_csv(dataset_url, iterator=True, chunksize=100000)\\nThe data needs to be appended to the parquet file using the fastparquet engine\\ndf.to_parquet(path, compression=\"gzip\", engine=\\'fastparquet\\', append=True)',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Process to download the VSC using Pandas is killed right away'},\n",
       "  {'text': 'denied: requested access to the resource is denied\\nThis can happen when you\\nHaven\\'t logged in properly to Docker Desktop (use docker login -u \"myusername\")\\nHave used the wrong username when pushing to docker images. Use the same one as your username and as the one you build on\\ndocker image build -t <myusername>/<imagename>:<tag>\\ndocker image push <myusername>/<imagename>:<tag>',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Push to docker image failure'},\n",
       "  {'text': \"16:21:35.607 | INFO    | Flow run 'singing-malkoha' - Executing 'write_bq-b366772c-0' immediately...\\nKilled\\nSolution:  You probably are running out of memory on your VM and need to add more.  For example, if you have 8 gigs of RAM on your VM, you may want to expand that to 16 gigs.\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Flow script fails with “killed” message:'},\n",
       "  {'text': 'After playing around with prefect for a while this can happen.\\nSsh to your VM and run sudo du -h --block-size=G | sort -n -r | head -n 30 to see which directory needs the most space.\\nMost likely it will be …/.prefect/storage, where your cached flows are stored. You can delete older flows from there. You also have to delete the corresponding flow in the UI, otherwise it will throw you an error, when you try to run your next flow.\\nSSL Certificate Verify: (I got it when trying to run flows on MAC): urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]\\npip install certifi\\n/Applications/Python\\\\ {ver}/Install\\\\ Certificates.command\\nor\\nrunning the “Install Certificate.command” inside of the python{ver} folder',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'GCP VM: Disk Space is full'},\n",
       "  {'text': 'It means your container consumed all available RAM allocated to it. It can happen in particular when working on Question#3 in the homework as the dataset is relatively large and containers eat a lot of memory in general.\\nI would recommend restarting your computer and only starting the necessary processes to run the container. If that doesn’t work, allocate more resources to docker. If also that doesn’t work because your workstation is a potato, you can use an online compute environment service like GitPod, which is free under under 50 hours / month of use.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Docker: container crashed with status code 137.'},\n",
       "  {'text': 'In Q3 there was a task to run the etl script from web to GCS. The problem was, it wasn’t really an ETL straight from web to GCS, but it was actually a web to local storage to local memory to GCS over network ETL. Yellow data is about 100 MB each per month compressed and ~700 MB after uncompressed on memory\\nThis leads to a problem where i either got a network type error because my not so good 3rd world internet or i got my WSL2 crashed/hanged because out of memory error and/or 100% resource usage hang.\\nSolution:\\nif you have a lot of time at hand, try compressing it to parquet and writing it to GCS with the timeout argument set to a really high number (the default os 60 seconds)\\nthe yellow taxi data for feb 2019 is about 100MB as parquet file\\ngcp_cloud_storage_bucket_block.upload_from_path(\\nfrom_path=f\"{path}\",\\nto_path=path,\\ntimeout=600\\n)',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Timeout due to slow upload internet'},\n",
       "  {'text': 'This error occurs when you try to re-run the export block, of the transformed green_taxi data to PostgreSQL.\\nWhat you’ll need to do is to drop the table using SQL in Mage (screenshot below).\\nYou should be able to re-run the block successfully after dropping the table.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'UndefinedColumn: column \"ratecode_id\", \"rate_code_id\" “vendor_id”, “pu_location_id”, “do_location_id” of relation \"green_taxi\" does not exist - Export transformed green_taxi data to PostgreSQL'},\n",
       "  {'text': 'SettingWithCopyWarning:\\nA value is trying to be set on a copy of a slice from a DataFrame.\\nUse the data.loc[] = value syntax instead of df[] = value to ensure that the new column is being assigned to the original dataframe instead of a copy of a dataframe or a series.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Homework - Q3 SettingWithCopyWarning Error:'},\n",
       "  {'text': 'CSV Files are very big in nyc data, so we instead of using Pandas/Python kernel , we can try Pyspark Kernel\\nDocumentation of Mage for using pyspark kernel: https://docs.mage.ai/integrations/spark-pyspark\\n?',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Since I was using slow laptop, and we have so big csv files, I used pyspark kernel in mage instead of python, How to do it?'},\n",
       "  {'text': 'So we will first delete the connection between blocks then we can remove the connection.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'I got an error when I was deleting  BLOCK IN A PIPELINE'},\n",
       "  {'text': 'While Editing the Pipeline Name It throws permission denied error.\\n(Work around)In that case proceed with the work and save later on revisit it will let you edit.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage UI won’t let you edit the Pipeline name?'},\n",
       "  {'text': 'Solution n°1 if you want to download everything :\\n```\\nimport pyarrow as pa\\nimport pyarrow.parquet as pq\\nfrom pyarrow.fs import GcsFileSystem\\n…\\n@data_loader\\ndef load_data(*args, **kwargs):\\n    bucket_name = YOUR_BUCKET_NAME_HERE\\'\\n    blob_prefix = \\'PATH / TO / WHERE / THE / PARTITIONS / ARE\\'\\n    root_path = f\"{bucket_name}/{blob_prefix}\"\\npa_table = pq.read_table(\\n        source=root_path,\\n        filesystem=GcsFileSystem(),        \\n    )\\n\\n    return pa_table.to_pandas()\\nSolution n°2 if you want to download only some dates :\\n@data_loader\\ndef load_data(*args, **kwargs):\\ngcs = pa.fs.GcsFileSystem()\\nbucket_name = \\'YOUR_BUCKET_NAME_HERE\\'\\nblob_prefix = \\'\\'PATH / TO / WHERE / THE / PARTITIONS / ARE\\'\\'\\nroot_path = f\"{bucket_name}/{blob_prefix}\"\\npa_dataset = pq.ParquetDataset(\\npath_or_paths=root_path,\\nfilesystem=gcs,\\nfilters=[(\\'lpep_pickup_date\\', \\'>=\\', \\'2020-10-01\\'), (\\'lpep_pickup_date\\', \\'<=\\', \\'2020-10-31\\')]\\n)\\nreturn pa_dataset.read().to_pandas()\\n# More information about the pq.Parquet.Dataset : Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. Documentation here :\\nhttps://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset\\nERROR: UndefinedColumn: column \"vendor_id\" of relation \"green_taxi\" does not exist\\nTwo possible solutions both of them work in the same way.\\nOpen up a Data Loader connect using SQL - RUN the command \\n`DROP TABLE mage.green_taxi`\\nElse, Open up a Data Extractor of SQL  - increase the rows to above the number of rows in the dataframe (you can find that in the bottom of the transformer block) change the Write Policy to `Replace` and run the SELECT statement',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'How do I make Mage load the partitioned files that we created on 2.2.4, to load them into BigQuery ?'},\n",
       "  {'text': \"All mage files are in your /home/src/folder where you saved your credentials.json so you should be able to access them locally. You will see a folder for ‘Pipelines’,  'data loaders', 'data transformers' & 'data exporters' - inside these will be the .py or .sql files for the blocks you created in your pipeline.\\nRight click & ‘download’ the pipeline itself to your local machine (which gives you metadata, pycache and other files)\\nAs above, download each .py/.sql file that corresponds to each block you created for the pipeline. You'll find these under 'data loaders', 'data transformers' 'data exporters'\\nMove the downloaded files to your GitHub repo folder & commit your changes.\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Git - What Files Should I Submit for Homework 2 & How do I get them out of MAGE:'},\n",
       "  {'text': 'Assuming you downloaded the Mage repo in the week 2 folder of the Data Engineering Zoomcamp, you might want to include your mage copy, demo pipelines and homework within your personal copy of the Data Engineering Zoomcamp repo. This will not work by default, because GitHub sees them as two separate repositories, and one does not track the other. To add the Mage files to your main DE Zoomcamp repo, you will need to:\\nMove the contents of the .gitignore file in your main .gitignore.\\nUse the terminal to cd into the Mage folder and:\\nrun “git remote remove origin” to de-couple the Mage repo,\\nrun “rm -rf .git” to delete local git files,\\nrun “git add .” to add the current folder as changes to stage, commit and push.',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Git - How do I include the files in the Mage repo (including exercise files and homework) in a personal copy of the Data Engineering Zoomcamp repo?'},\n",
       "  {'text': \"When try to add three assertions:\\nvendor_id is one of the existing values in the column (currently)\\npassenger_count is greater than 0\\ntrip_distance is greater than 0\\nto test_output, I got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Below is my code:\\ndata_filter = (data['passenger_count'] > 0) and (data['trip_distance'] > 0)\\nAfter looking for solutions at Stackoverflow, I found great discussion about it. So I changed my code into:\\ndata_filter = (data['passenger_count'] > 0) & (data['trip_distance'] > 0)\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()'},\n",
       "  {'text': 'This happened when I just booted up my PC, continuing from the progress I was doing from yesterday.\\nAfter cd-ing into your directory, and running docker compose up , the web interface for the Mage shows, but the files that I had yesterday was gone.\\nIf your files are gone, go ahead and close the web interface, and properly shutting down the mage docker compose by doing Ctrl + C once. Try running it again. This worked for me more than once (yes the issue persisted with my PC twice)\\nAlso, you should check if you’re in the correct repository before doing docker compose up . This was discussed in the Slack #course-data-engineering channel',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage AI Files are Gone/disappearing'},\n",
       "  {'text': 'The above errors due to “ at the trailing side and it need to be modified with ‘ quotes at both ends\\nKrishna Anand',\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage - Errors in io.config.yaml file'},\n",
       "  {'text': \"Problem: The following error occurs when attempting to export data from Mage to a GCS bucket using pyarrow suggesting Mage doesn’t have the necessary permissions to access the specified GCP credentials .json file.\\nArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetBucketMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: Cannot open credentials file /home/src/...\\nSolution: Inside the Mage app:\\nCreate a credentials folder (e.g. gcp-creds) within the magic-zoomcamp folder\\nIn the credentials folder create a .json key file (e.g. mage-gcp-creds.json)\\nCopy/paste GCP service account credentials into the .json key file and save\\nUpdate code to point to this file. E.g.\\nenviron['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/magic-zoomcamp/gcp-creds/mage-gcp-creds.json'\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage - ArrowException Cannot open credentials file'},\n",
       "  {'text': \"Oserror: google::cloud::status(unavailable: retry policy exhausted getbucketmetadata: could not create a OAuth2 access token to authenticate the request. the request was not sent, as such an access token is required to complete the request successfully. learn more about google cloud authentication at https://cloud.google.com/docs/authentication. the underlying error message was: performwork() - curl error [6]=couldn't resolve host name)\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage - OSError'},\n",
       "  {'text': \"Problem: The following error occurs when attempting to export data from Mage to a GCS bucket. Assigned service account doesn’t have the necessary permissions access Google Cloud Storage Bucket\\nPermissionError: [Errno 13] google::cloud::Status(PERMISSION_DENIED: Permanent error GetBucketMetadata:... .iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist). error_info={reason=forbidden, domain=global, metadata={http_status_code=403}}). Detail: [errno 13] Permission denied\\nSolution: Add Cloud Storage Admin role to the service account:\\nGo to project in Google Cloud Console>IAM & Admin>IAM\\nClick Edit principal (pencil symbol) to the right of the service account you are using\\nClick + ADD ANOTHER ROLE\\nSelect Cloud Storage>Storage Admin\\nClick Save\",\n",
       "   'section': 'Module 2: Workflow Orchestration',\n",
       "   'question': 'Mage - PermissionError service account does not have storage.buckets.get access to the Google Cloud Storage bucket'},\n",
       "  {'text': '1. Make sure your pyspark script is ready to be send to Dataproc cluster\\n2. Create a Dataproc Cluster in GCP Console\\n3. Make sure to edit the service account and add new role - Dataproc Editor\\n4. Copy the python script ./notebooks/pyspark_script.py and place it under GCS bucket path\\n5. Make sure gcloud cli is installed either in Mage manually or  via your Dockerfile and docker-compose files. This is needed to let Mage access google Dataproc and the script it needs to execute. Refer - Installing the latest gcloud CLI\\n6. Use the Bigquery/Dataproc script mentioned here - https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/code/cloud.md . Use Mage to trigger the query',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'Trigger Dataproc from Mage'},\n",
       "  {'text': 'A:\\n1 solution) Add -Y flag, so that apt-get automatically agrees to install additional packages\\n2) Use python ZipFile package, which is included in all modern python distributions',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'Docker-compose takes infinitely long to install zip unzip packages for linux, which are required to unpack datasets'},\n",
       "  {'text': 'Make sure to use Nullable dataTypes, such as Int64 when appliable.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCS Bucket - error when writing data from web to GCS:'},\n",
       "  {'text': 'Ultimately, when trying to ingest data into a BigQuery table, all files within a given directory must have the same schema.\\nWhen dealing for example with the FHV Datasets from 2019, however (see image below), one can see that the files for \\'2019-05\\', and 2019-06, have the columns \"PUlocationID\" and \"DOlocationID\" as Integers, while for the period of \\'2019-01\\' through \\'2019-04\\', the same column is defined as FLOAT.\\nSo while importing these files as parquet to BigQuery, the first one will be used to define the schema of the table, while all files following that will be used to append data on the existing table. Which means, they must all follow the very same schema of the file that created the table.\\nSo, in order to prevent errors like that, make sure to enforce the data types for the columns on the DataFrame before you serialize/upload them to BigQuery. Like this:\\npd.read_csv(\"path_or_url\").astype({\\n\\t\"col1_name\": \"datatype\",\\t\\n\\t\"col2_name\": \"datatype\",\\t\\n\\t...\\t\\t\\t\\t\\t\\n\\t\"colN_name\": \"datatype\" \\t\\n})',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': \"GCS Bucket - Failed to create table: Error while reading data, error message: Parquet column 'XYZ' has type INT which does not match the target cpp_type DOUBLE. File: gs://path/to/some/blob.parquet\"},\n",
       "  {'text': \"If you receive the error gzip.BadGzipFile: Not a gzipped file (b'\\\\n\\\\n'), this is because you have specified the wrong URL to the FHV dataset. Make sure to use https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/{dataset_file}.csv.gz\\nEmphasising the ‘/releases/download’ part of the URL.\",\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCS Bucket - Fix Error when importing FHV data to GCS'},\n",
       "  {'text': 'Krishna Anand',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCS Bucket - Load Data From URL list in to GCP Bucket'},\n",
       "  {'text': 'Check the Schema\\nYou might have a wrong formatting\\nTry to upload the CSV.GZ files without formatting or going through pandas via wget\\nSee this Slack conversation for helpful tips',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCS Bucket - I query my dataset and get a Bad character (ASCII 0) error?'},\n",
       "  {'text': 'Run the following command to check if “BigQuery Command Line Tool” is installed or not: gcloud components list\\nYou can also use bq.cmd instead of bq to make it work.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - “bq: command not found”'},\n",
       "  {'text': 'Use big queries carefully,\\nI created by bigquery dataset on an account where my free trial was exhausted, and got a bill of $80.\\nUse big query in free credits and destroy all the datasets after creation.\\nCheck your Billing daily! Especially if you’ve spinned up a VM.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Caution in using bigquery:no'},\n",
       "  {'text': 'Be careful when you create your resources on GCP, all of them have to share the same Region in order to allow load data from GCS Bucket to BigQuery. If you forgot it when you created them, you can create a new dataset on BigQuery using the same Region which you used on your GCS Bucket.\\nThis means that your GCS Bucket and the BigQuery dataset are placed in different regions. You have to create a new dataset inside BigQuery in the same region with your GCS bucket and store the data in the newly created dataset.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Cannot read and write in different locations: source: EU, destination: US - Loading data from GCS into BigQuery (different Region):'},\n",
       "  {'text': \"Make sure to create the BigQuery dataset in the very same location that you've created the GCS Bucket. For instance, if your GCS Bucket was created in `us-central1`, then BigQuery dataset must be created in the same region (us-central1, in this example)\",\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Cannot read and write in different locations: source: <REGION_HERE>, destination: <ANOTHER_REGION_HERE>'},\n",
       "  {'text': 'By the way, this isn’t a problem/solution, but a useful hint:\\nPlease, remember to save your progress in BigQuery SQL Editor.\\nI was almost finishing the homework, when my Chrome Tab froze and I had to reload it. Then I lost my entire SQL script.\\nSave your script from time to time. Just click on the button at the top bar. Your saved file will be available on the left panel.\\nAlternatively, you can copy paste your queries into an .sql file in your preferred editor (Notepad++, VS Code, etc.). Using the .sql extension will provide convenient color formatting.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Remember to save your queries'},\n",
       "  {'text': 'Ans :  While real-time analytics might not be explicitly mentioned, BigQuery has real-time data streaming capabilities, allowing for potential integration in future project iterations.',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Can I use BigQuery for real-time analytics in this project?'},\n",
       "  {'text': \"could not parse 'pickup_datetime' as timestamp for field pickup_datetime (position 2)\\nThis error is caused by invalid data in the timestamp column. A way to identify the problem is to define the schema from the external table using string datatype. This enables the queries to work at which point we can filter out the invalid rows from the import to the materialised table and insert the fields with the timestamp data type.\",\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Unable to load data from external tables into a materialized table in BigQuery due to an invalid timestamp error that are added while appending data to the file in Google Cloud Storage'},\n",
       "  {'text': 'Background:\\n`pd.read_parquet`\\n`pd.to_datetime`\\n`pq.write_to_dataset`\\nReference:\\nhttps://stackoverflow.com/questions/48314880/are-parquet-file-created-with-pyarrow-vs-pyspark-compatible\\nhttps://stackoverflow.com/questions/57798479/editing-parquet-files-with-python-causes-errors-to-datetime-format\\nhttps://www.reddit.com/r/bigquery/comments/16aoq0u/parquet_timestamp_to_bq_coming_across_as_int/?share_id=YXqCs5Jl6hQcw-kg6-VgF&utm_content=1&utm_medium=ios_app&utm_name=ioscss&utm_source=share&utm_term=1\\nSolution:\\nAdd `use_deprecated_int96_timestamps=True` to `pq.write_to_dataset` function, like below\\npq.write_to_dataset(\\ntable,\\nroot_path=root_path,\\nfilesystem=gcs,\\nuse_deprecated_int96_timestamps=True\\n# Write timestamps to INT96 Parquet format\\n)',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Error Message in BigQuery: annotated as a valid Timestamp, please annotate it as TimestampType(MICROS) or TimestampType(MILLIS)'},\n",
       "  {'text': 'Solution:\\nIf you’re using Mage, in the last Data Exporter that writes to Google Cloud Storage use PyArrow to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won\\'t be converted to timestamp when loaded by BigQuery later on.\\nimport pyarrow as pa\\nimport pyarrow.parquet as pq\\nimport os\\nif \\'data_exporter\\' not in globals():\\nfrom mage_ai.data_preparation.decorators import data_exporter\\n# Replace with the location of your service account key JSON file.\\nos.environ[\\'GOOGLE_APPLICATION_CREDENTIALS\\'] = \\'/home/src/personal-gcp.json\\'\\nbucket_name = \"<YOUR_BUCKET_NAME>\"\\nobject_key = \\'nyc_taxi_data_2022.parquet\\'\\nwhere = f\\'{bucket_name}/{object_key}\\'\\n@data_exporter\\ndef export_data(data, *args, **kwargs):\\ntable = pa.Table.from_pandas(data, preserve_index=False)\\ngcs = pa.fs.GcsFileSystem()\\npq.write_table(\\ntable,\\nwhere,\\n# Convert integer columns in Epoch milliseconds\\n# to Timestamp columns in microseconds (\\'us\\') so\\n# they can be loaded into BigQuery with the right\\n# data type\\ncoerce_timestamps=\\'us\\',\\nfilesystem=gcs\\n)\\nSolution 2:\\nIf you’re using Mage, in the last Data Exporter that writes to Google Cloud Storage, provide PyArrow with explicit schema to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won\\'t be converted to timestamp when loaded by BigQuery later on.\\nschema = pa.schema([\\n(\\'vendor_id\\', pa.int64()),\\n(\\'lpep_pickup_datetime\\', pa.timestamp(\\'ns\\')),\\n(\\'lpep_dropoff_datetime\\', pa.timestamp(\\'ns\\')),\\n(\\'store_and_fwd_flag\\', pa.string()),\\n(\\'ratecode_id\\', pa.int64()),\\n(\\'pu_location_id\\', pa.int64()),\\n(\\'do_location_id\\', pa.int64()),\\n(\\'passenger_count\\', pa.int64()),\\n(\\'trip_distance\\', pa.float64()),\\n(\\'fare_amount\\', pa.float64()),\\n(\\'extra\\', pa.float64()),\\n(\\'mta_tax\\', pa.float64()),\\n(\\'tip_amount\\', pa.float64()),\\n(\\'tolls_amount\\', pa.float64()),\\n(\\'improvement_surcharge\\', pa.float64()),\\n(\\'total_amount\\', pa.float64()),\\n(\\'payment_type\\', pa.int64()),\\n(\\'trip_type\\', pa.int64()),\\n(\\'congestion_surcharge\\', pa.float64()),\\n(\\'lpep_pickup_month\\', pa.int64())\\n])\\ntable = pa.Table.from_pandas(data, schema=schema)',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Datetime columns in Parquet files created from Pandas show up as integer columns in BigQuery'},\n",
       "  {'text': 'Reference:\\nhttps://cloud.google.com/bigquery/docs/external-data-cloud-storage\\nSolution:\\nfrom google.cloud import bigquery\\n# Set table_id to the ID of the table to create\\ntable_id = f\"{project_id}.{dataset_name}.{table_name}\"\\n# Construct a BigQuery client object\\nclient = bigquery.Client()\\n# Set the external source format of your table\\nexternal_source_format = \"PARQUET\"\\n# Set the source_uris to point to your data in Google Cloud\\nsource_uris = [ f\\'gs://{bucket_name}/{object_key}/*\\']\\n# Create ExternalConfig object with external source format\\nexternal_config = bigquery.ExternalConfig(external_source_format)\\n# Set source_uris that point to your data in Google Cloud\\nexternal_config.source_uris = source_uris\\nexternal_config.autodetect = True\\ntable = bigquery.Table(table_id)\\n# Set the external data configuration of the table\\ntable.external_data_configuration = external_config\\ntable = client.create_table(table)  # Make an API request.\\nprint(f\\'Created table with external source: {table_id}\\')\\nprint(f\\'Format: {table.external_data_configuration.source_format}\\')',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Create External Table using Python'},\n",
       "  {'text': 'Reference:\\nhttps://stackoverflow.com/questions/60941726/can-bigquery-api-overwrite-existing-table-view-with-create-table-tables-inser\\nSolution:\\nCombine with “Create External Table using Python”, use it before “client.create_table” function.\\ndef tableExists(tableID, client):\\n\"\"\"\\nCheck if a table already exists using the tableID.\\nreturn : (Boolean)\\n\"\"\"\\ntry:\\ntable = client.get_table(tableID)\\nreturn True\\nexcept Exception as e: # NotFound:\\nreturn False',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Check BigQuery Table Exist And Delete'},\n",
       "  {'text': 'To avoid this error you can upload data from Google Cloud Storage to BigQuery through BigQuery Cloud Shell using the command:\\n$ bq load  --autodetect --allow_quoted_newlines --source_format=CSV dataset_name.table_name \"gs://dtc-data-lake-bucketname/fhv/fhv_tripdata_2019-*.csv.gz\"',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Error: Missing close double quote (\") character'},\n",
       "  {'text': 'Solution: This problem arises if your gcs and bigquery storage is in different regions.\\nOne potential way to solve it:\\nGo to your google cloud bucket and check the region in field named “Location”\\nNow in bigquery, click on three dot icon near your project name and select create dataset.\\nIn region filed choose the same regions as you saw in your google cloud bucket',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Cannot read and write in different locations: source: asia-south2, destination: US'},\n",
       "  {'text': 'There are multiple benefits of using Cloud Functions to automate tasks in Google Cloud.\\nUse below Cloud Function python script to load files directly to BigQuery. Use your project id, dataset id & table id as defined by you.\\nimport tempfile\\nimport requests\\nimport logging\\nfrom google.cloud import bigquery\\ndef hello_world(request):\\n# table_id = <project_id.dataset_id.table_id>\\ntable_id = \\'de-zoomcap-project.dezoomcamp.fhv-2019\\'\\n# Create a new BigQuery client\\nclient = bigquery.Client()\\nfor month in range(4, 13):\\n# Define the schema for the data in the CSV.gz files\\nurl = \\'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-{:02d}.csv.gz\\'.format(month)\\n# Download the CSV.gz file from Github\\nresponse = requests.get(url)\\n# Create new table if loading first month data else append\\nwrite_disposition_string = \"WRITE_APPEND\" if month > 1 else \"WRITE_TRUNCATE\"\\n# Defining LoadJobConfig with schema of table to prevent it from changing with every table\\njob_config = bigquery.LoadJobConfig(\\nschema=[\\nbigquery.SchemaField(\"dispatching_base_num\", \"STRING\"),\\nbigquery.SchemaField(\"pickup_datetime\", \"TIMESTAMP\"),\\nbigquery.SchemaField(\"dropOff_datetime\", \"TIMESTAMP\"),\\nbigquery.SchemaField(\"PUlocationID\", \"STRING\"),\\nbigquery.SchemaField(\"DOlocationID\", \"STRING\"),\\nbigquery.SchemaField(\"SR_Flag\", \"STRING\"),\\nbigquery.SchemaField(\"Affiliated_base_number\", \"STRING\"),\\n],\\nskip_leading_rows=1,\\nwrite_disposition=write_disposition_string,\\nautodetect=True,\\nsource_format=\"CSV\",\\n)\\n# Load the data into BigQuery\\n# Create a temporary file to prevent the exception- AttributeError: \\'bytes\\' object has no attribute \\'tell\\'\"\\nwith tempfile.NamedTemporaryFile() as f:\\nf.write(response.content)\\nf.seek(0)\\njob = client.load_table_from_file(\\nf,\\ntable_id,\\nlocation=\"US\",\\njob_config=job_config,\\n)\\njob.result()\\nlogging.info(\"Data for month %d successfully loaded into table %s.\", month, table_id)\\nreturn \\'Data loaded into table {}.\\'.format(table_id)',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - Tip: Using Cloud Function to read csv.gz files from github directly to BigQuery in Google Cloud:'},\n",
       "  {'text': 'You need to uncheck cache preferences in query settings',\n",
       "   'section': 'Module 3: Data Warehousing',\n",
       "   'question': 'GCP BQ - When querying two different tables external and materialized you get the same result when count(distinct(*))'},\n",
       "  {'text': 'Problem: When you inject data into GCS using Pandas, there is a chance that some dataset has missing values on  DOlocationID and PUlocationID. Pandas by default will cast these columns as float data type, causing inconsistent data type between parquet in GCS and schema defined in big query. You will see something like this:\\nSolution:\\nFix the data type issue in data pipeline\\nBefore injecting data into GCS, use astype and Int64 (which is different from int64 and accept both missing value and integer exist in the column) to cast the columns.\\nSomething like:\\ndf[\"PUlocationID\"] = df.PUlocationID.astype(\"Int64\")\\ndf[\"DOlocationID\"] = df.DOlocationID.astype(\"Int64\")\\nNOTE: It is best to define the data type of all the columns in the Transformation section of the ETL pipeline before loading to BigQuery',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ - How to handle type error from big query and parquet data?'},\n",
       "  {'text': 'Problem occurs when misplacing content after fro``m clause in BigQuery SQLs.\\nCheck to remove any extra apaces or any other symbols, keep in lowercases, digits and dashes only',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ - Invalid project ID . Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project'},\n",
       "  {'text': 'No. Based on the documentation for Bigquery, it does not support more than 1 column to be partitioned.\\n[source]',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ - Does BigQuery support multiple columns partition?'},\n",
       "  {'text': 'Error Message:\\nPARTITION BY expression must be DATE(<timestamp_column>), DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>, DAY/HOUR/MONTH/YEAR), a DATE column, TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR), DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))\\nSolution:\\nConvert the column to datetime first.\\ndf[\"pickup_datetime\"] = pd.to_datetime(df[\"pickup_datetime\"])\\ndf[\"dropOff_datetime\"] = pd.to_datetime(df[\"dropOff_datetime\"])',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ - DATE() Error in BigQuery'},\n",
       "  {'text': 'Native tables are tables where the data is stored in BigQuery.  External tables store the data outside BigQuery, with BigQuery storing metadata about that external table.\\nResources:\\nhttps://cloud.google.com/bigquery/docs/external-tables\\nhttps://cloud.google.com/bigquery/docs/tables-intro',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ - Native tables vs External tables in BigQuery?'},\n",
       "  {'text': 'Issue: Tried running command to export ML model from BQ to GCS from Week 3\\nbq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model\\nIt is failing on following error:\\nBigQuery error in extract operation: Error processing job Not found: Dataset was not found in location US\\nI verified the BQ data set and gcs bucket are in the same region- us-west1. Not sure how it gets location US. I couldn’t find the solution yet.\\nSolution:  Please enter correct project_id and gcs_bucket folder address. My gcs_bucket folder address is\\ngs://dtc_data_lake_optimum-airfoil-376815/tip_model',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ ML - Unable to run command (shown in video) to export ML model from BQ to GCS'},\n",
       "  {'text': \"To solve this error mention the location = US when creating the dim_zones table\\n{{ config(\\nmaterialized='table',\\nlocation='US'\\n) }}\\nJust Update this part to solve the issue and run the dim_zones again and then run the fact_trips\",\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Dim_zones.sql Dataset was not found in location US When Running fact_trips.sql'},\n",
       "  {'text': 'Solution: proceed with setting up serving_dir on your computer as in the extract_model.md file. Then instead of\\ndocker pull tensorflow/serving\\nuse\\ndocker pull emacski/tensorflow-serving\\nThen\\ndocker run -p 8500:8500 -p 8501:8501 --mount type=bind,source=`pwd`/serving_dir/tip_model,target=/models/tip_model -e MODEL_NAME=tip_model -t emacski/tensorflow-serving\\nThen run the curl command as written, and you should get a prediction.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'GCP BQ ML - Export ML model to make predictions does not work for MacBook with Apple M1 chip (arm architecture).'},\n",
       "  {'text': 'Try deleting data you’ve saved to your VM locally during ETLs\\nKill processes related to deleted files\\nDownload ncdu and look for large files (pay particular attention to files related to Prefect)\\nIf you delete any files related to Prefect, eliminate caching from your flow code',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'VMs - What do I do if my VM runs out of space?'},\n",
       "  {'text': \"Ans: What they mean is that they don't want you to do anything more than that. You should load the files into the bucket and create an external table based on those files (but nothing like cleaning the data and putting it in parquet format)\",\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': \"Homework - What does it mean “Stop with loading the files into a bucket.' Stop with loading the files into a bucket?”\"},\n",
       "  {'text': 'If for whatever reason you try to read parquets directly from nyc.gov’s cloudfront into pandas, you might run into this error:\\npyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds\\nCause:\\nthere is one errant data record where the dropOff_datetime was set to year 3019 instead of 2019.\\npandas uses “timestamp[ns]” (as noted above), and int64 only allows a ~580 year range, centered on 2000. See `pd.Timestamp.max` and `pd.Timestamp.min`\\nThis becomes out of bounds when pandas tries to read it because 3019 > 2300 (approx value of pd.Timestamp.Max\\nFix:\\nUse pyarrow to read it:\\nimport pyarrow.parquet as pq df = pq.read_table(\\'fhv_tripdata_2019-02.parquet\\').to_pandas(safe=False)\\nHowever this results in weird timestamps for the offending record\\nRead the datetime columns separately using pq.read_table\\n\\ntable = pq.read_table(‘taxi.parquet’)\\ndatetimes = [‘list of datetime column names’]\\ndf_dts = pd.DataFrame()\\nfor col in datetimes:\\ndf_dts[col] = pd.to_datetime(table .column(col), errors=\\'coerce\\')\\n\\nThe `errors=’coerce’` parameter will convert the out of bounds timestamps into either the max or the min\\nUse parquet.compute.filter to remove the offending rows\\n\\nimport pyarrow.compute as pc\\ntable = pq.read_table(\"‘taxi.parquet\")\\ndf = table.filter(\\npc.less_equal(table[\"dropOff_datetime\"], pa.scalar(pd.Timestamp.max))\\n).to_pandas()',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Homework - Reading parquets from nyc.gov directly into pandas returns Out of bounds error'},\n",
       "  {'text': 'Answer: The 2022 NYC taxi data parquet files are available for each month separately. Therefore, you need to add all 12 files to your GCS bucket and then refer to them using the URIs option when creating an external table in BigQuery. You can use the wildcard \"*\" to refer to all 12 files using a single string.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Question: for homework 3 , we need all 12 parquet files for green taxi 2022 right ?'},\n",
       "  {'text': 'This can help avoid schema issues in the homework. \\nDownload files locally and use the ‘upload files’ button in GCS at the desired path. You can upload many files at once. You can also choose to upload a folder.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Homework - Uploading files to GCS via GUI'},\n",
       "  {'text': 'Ans: Take a careful look at the format of the dates in the question.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Homework - Qn 5: The partitioned/clustered table isn’t giving me the prediction I expected'},\n",
       "  {'text': 'Many people aren’t getting an exact match, but are very close to one of the options. As per Alexey said to choose the closest option.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Homework - Qn 6: Did anyone get an exact match for one of the options given in Module 3 homework Q6?'},\n",
       "  {'text': 'UnicodeDecodeError: \\'utf-8\\' codec can\\'t decode byte 0xa0 in position 41721: invalid start byte\\nSolution:\\nStep 1: When reading the data from the web into the pandas dataframe mention the encoding as follows:\\npd.read_csv(dataset_url, low_memory=False, encoding=\\'latin1\\')\\nStep 2: When writing the dataframe from the local system to GCS as a csv mention the encoding as follows:\\ndf.to_csv(path_on_gsc, compression=\"gzip\", encoding=\\'utf-8\\')\\nAlternative: use pd.read_parquet(url)',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Python - invalid start byte Error Message'},\n",
       "  {'text': 'A generator is a function in python that returns an iterator using the yield keyword.\\nA generator is a special type of iterable, similar to a list or a tuple, but with a crucial difference. Instead of creating and storing all the values in memory at once, a generator generates values on-the-fly as you iterate over it. This makes generators memory-efficient, particularly when dealing with large datasets.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Python - Generators in python'},\n",
       "  {'text': 'The read_parquet function supports a list of files as an argument. The list of files will be merged into a single result table.',\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Python - Easiest way to read multiple files at the same time?'},\n",
       "  {'text': \"Incorrect:\\ndf['DOlocationID'] = pd.to_numeric(df['DOlocationID'], downcast=integer) or\\ndf['DOlocationID'] = df['DOlocationID'].astype(int)\\nCorrect:\\ndf['DOlocationID'] = df['DOlocationID'].astype('Int64')\",\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': \"Python - These won't work. You need to make sure you use Int64:\"},\n",
       "  {'text': \"ValueError: Path /Users/kt/.prefect/storage/44ccce0813ed4f24ab2d3783de7a9c3a does not exist.\\nRemove ```cache_key_fn=task_input_hash ``` as it’s in argument in your function & run your flow again.\\nNote: catche key is beneficial if you happen to run the code multiple times, it won't repeat the process which you have finished running in the previous run.  That means, if you have this ```cache_key``` in your initial run, this might cause the error.\",\n",
       "   'section': \"error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.\",\n",
       "   'question': 'Prefect - Error on Running Prefect Flow to Load data to GCS'},\n",
       "  {'text': '@task\\ndef download_file(url: str, file_path: str):\\nresponse = requests.get(url)\\nopen(file_path, \"wb\").write(response.content)\\nreturn file_path\\n@flow\\ndef extract_from_web() -> None:\\nfile_path = download_file(url=f\\'{url-filename}.csv.gz\\',file_path=f\\'{filename}.csv.gz\\')',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Prefect - Tip: Downloading csv.gz from a url in a prefect environment (sample snippet).'},\n",
       "  {'text': 'Update the seed column types in the dbt_project.yaml file\\nfor using double : float\\nfor using int : numeric\\nDBT Cloud production error: prod dataset not available in location EU\\nProblem: I am trying to deploy my DBT  models to production, using DBT Cloud. The data should live in BigQuery.  The dataset location is EU.  However, when I am running the model in production, a prod dataset is being create in BigQuery with a location US and the dbt invoke build is failing giving me \"ERROR 404: porject.dataset:prod not available in location EU\". I tried different ways to fix this. I am not sure if there is a more simple solution then creating my project or buckets in location US. Hope anyone can help here.\\nNote: Everything is working fine in development mode, the issue is just happening when scheduling and running job in production\\nSolution: I created the prod dataset manually in BQ and specified EU, then I ran the job.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'If you are getting not found in location us error.'},\n",
       "  {'text': 'Error: This project does not have a development environment configured. Please create a development environment and configure your development credentials to use the dbt IDE.\\nThe error itself tells us how to solve this issue, the guide is here. And from videos @1:42 and also slack chat',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Setup - No development environment'},\n",
       "  {'text': \"Runtime Error\\ndbt was unable to connect to the specified database.\\nThe database returned the following error:\\n>Database Error\\nAccess Denied: Project <project_name>: User does not have bigquery.jobs.create permission in project <project_name>.\\nCheck your database credentials and try again. For more information, visit:\\nhttps://docs.getdbt.com/docs/configure-your-profile\\nSteps to resolve error in Google Cloud:\\n1. Navigate to IAM & Admin and select IAM\\n2. Click Grant Access if your newly created dbt service account isn't listed\\n3. In New principals field, add your service account\\n4. Select a Role and search for BigQuery Job User to add\\n5. Go back to dbt cloud project setup and Test your connection\\n6. Note: Also add BigQuery Data Owner, Storage Object Admin, & Storage Admin to prevent permission issues later in the course\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Setup - Connecting dbt Cloud with BigQuery Error'},\n",
       "  {'text': 'error: This dbt Cloud run was cancelled because a valid dbt project was not found. Please check that the repository contains a proper dbt_project.yml config file. If your dbt project is located in a subdirectory of the connected repository, be sure to specify its location on the Project settings page in dbt Cloud',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Dbt build error'},\n",
       "  {'text': \"Error: Failed to clone repository.\\ngit clone git@github.com:DataTalksClub/data-engineering-zoomcamp.git /usr/src/develop/…\\nCloning into '/usr/src/develop/...\\nWarning: Permanently added 'github.com,140.82.114.4' (ECDSA) to the list of known hosts.\\ngit@github.com: Permission denied (publickey).\\nfatal: Could not read from remote repository.\\nIssue: You don’t have permissions to write to DataTalksClub/data-engineering-zoomcamp.git\\nSolution 1: Clone the repository and use this forked repo, which contains your github username. Then, proceed to specify the path, as in:\\n[your github username]/data-engineering-zoomcamp.git\\nSolution 2: create a fresh repo for dbt-lessons. We’d need to do branching and PRs in this lesson, so it might be a good idea to also not mess up your whole other repo. Then you don’t have to create a subfolder for the dbt project files\\nSolution 3: Use https link\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Setup - Failed to clone repository.'},\n",
       "  {'text': \"Solution:\\nCheck if you’re on the Developer Plan. As per the prerequisites, you'll need to be enrolled in the Team Plan or Enterprise Plan to set up a CI Job in dbt Cloud.\\nSo If you're on the Developer Plan, you'll need to upgrade to utilise CI Jobs.\\nNote from another user: I’m in the Team Plan (trial period) but the option is still disabled. What worked for me instead was this. It works for the Developer (free) plan.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'dbt job - Triggered by pull requests is disabled when I try to create a new Continuous Integration job in dbt cloud.'},\n",
       "  {'text': 'Issue: If the DBT cloud IDE loading indefinitely then giving you this error\\nSolution: check the dbt_cloud_setup.md  file and make a SSH Key and use gitclone to import repo into dbt project, copy and paste deploy key back in your repo setting.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Setup - Your IDE session was unable to start. Please contact support.'},\n",
       "  {'text': 'Issue: If you don’t define the column format while converting from csv to parquet Python will “choose” based on the first rows.\\n✅Solution: Defined the schema while running web_to_gcp.py pipeline.\\nSebastian adapted the script:\\nhttps://github.com/sebastian2296/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/web_to_gcs.py\\nNeed a quick change to make the file work with gz files, added the following lines (and don’t forget to delete the file at the end of each iteration of the loop to avoid any problem of disk space)\\nfile_name_gz = f\"{service}_tripdata_{year}-{month}.csv.gz\"\\nopen(file_name_gz, \\'wb\\').write(r.content)\\nos.system(f\"gzip -d {file_name_gz}\")\\nos.system(f\"rm {file_name_init}.*\")\\nSame ERROR - When running dbt run for fact_trips.sql, the task failed with error:\\n“Parquet column \\'ehail_fee\\' has type DOUBLE which does not match the target cpp_type INT64”\\n开启屏幕阅读器支持\\n要启用屏幕阅读器支持，请按Ctrl+Alt+Z。要了解键盘快捷键，请按Ctrl+斜杠。\\n查找和替换\\nReason: Parquet files have their own schema. Some parquet files for green data have records with decimals in ehail_fee column.\\nThere are some possible fixes:\\nDrop ehail_feel column since it is not really used. For instance when creating a partitioned table from the external table in BigQuery\\nSELECT * EXCEPT (ehail_fee) FROM…\\nModify stg_green_tripdata.sql model using this line cast(0 as numeric) as ehail_fee.\\nModify Airflow dag to make the conversion and avoid the error.\\npv.read_csv(src_file, convert_options=pv.ConvertOptions(column_types = {\\'ehail_fee\\': \\'float64\\'}))\\nSame type of ERROR - parquet files with different data types - Fix it with pandas\\nHere is another possibility that could be interesting:\\nYou can specify the dtypes when importing the file from csv to a dataframe with pandas\\npd.from_csv(..., dtype=type_dict)\\nOne obstacle is that the regular int64 pandas use (I think this is from the numpy library) does not accept null values (NaN, not a number). But you can use the pandas Int64 instead, notice capital ‘I’. The type_dict is a python dictionary mapping the column names to the dtypes.\\nSources:\\nhttps://pandas.pydata.org/docs/reference/api/pandas.read_csv.html\\nNullable integer data type — pandas 1.5.3 documentation',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT - I am having problems with columns datatype while running DBT/BigQuery'},\n",
       "  {'text': 'If the provided URL isn’t working for you (https://nyc-tlc.s3.amazonaws.com/trip+data/):\\nWe can use the GitHub CLI to easily download the needed trip data from https://github.com/DataTalksClub/nyc-tlc-data, and manually upload to a GCS bucket.\\nInstructions on how to download the CLI here: https://github.com/cli/cli\\nCommands to use:\\ngh auth login\\ngh release list -R DataTalksClub/nyc-tlc-data\\ngh release download yellow -R DataTalksClub/nyc-tlc-data\\ngh release download green -R DataTalksClub/nyc-tlc-data\\netc.\\nNow you can upload the files to a GCS bucket using the GUI.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Ingestion: When attempting to use the provided quick script to load trip data into GCS, you receive error Access Denied from the S3 bucket'},\n",
       "  {'text': \"R: This conversion is needed for the question 3 of homework, in order to process files for fhv data. The error is:\\npyarrow.lib.ArrowInvalid: CSV parse error: Expected 7 columns, got 1: B02765\\nCause: Some random line breaks in this particular file.\\nFixed by opening a bash in the container executing the dag and manually running the following command that deletes all \\\\n not preceded by \\\\r.\\nperl -i -pe 's/(?<!\\\\r)\\\\n/\\\\1/g' fhv_tripdata_2020-01.csv\\nAfter that, clear the failed task in Airflow to force re-execution.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Ingestion - Error thrown by format_to_parquet_task when converting fhv_tripdata_2020-01.csv using Airflow'},\n",
       "  {'text': 'I initially followed data-engineering-zoomcamp/03-data-warehouse/extras/web_to_gcs.py at main · DataTalksClub/data-engineering-bootcamp (github.com)\\nBut it was taking forever for the yellow trip data and when I tried to download and upload the parquet files directly to GCS, that works fine but when creating the Bigquery table, there was a schema inconsistency issue\\nThen I found another hack shared in the slack which was suggested by Victoria.\\n[Optional] Hack for loading data to BigQuery for Week 4 - YouTube\\nPlease watch until the end as there is few schema changes required to be done',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Hack to load yellow and green trip data for 2019 and 2020'},\n",
       "  {'text': '“gs\\\\storage_link\\\\*.parquet” need to be added in destination folder',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Move many files (more than one) from Google cloud storage bucket to Big query'},\n",
       "  {'text': 'One common cause experienced is lack of space after running prefect several times. When running prefect, check the folder ‘.prefect/storage’ and delete the logs now and then to avoid the problem.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'GCP VM - All of sudden ssh stopped working for my VM after my last restart'},\n",
       "  {'text': 'You can try to do this steps:',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'GCP VM - If you have lost SSH access to your machine due to lack of space. Permission denied (publickey)'},\n",
       "  {'text': 'R: Go to BigQuery, and check the location of BOTH\\nThe source dataset (trips_data_all), and\\nThe schema you’re trying to write to (name should be \\tdbt_<first initial><last name> (if you didn’t change the default settings at the end when setting up your project))\\nLikely, your source data will be in your region, but the write location will be a multi-regional location (US in this example). Delete these datasets, and recreate them with your specified region and the correct naming format.\\nAlternatively, instead of removing datasets, you can specify the single-region location you are using. E.g. instead of ‘location: US’, specify the region, so ‘location: US-east1’. See this Github comment for more detail. Additionally please see this post of Sandy\\nIn DBT cloud you can actually specify the location using the following steps:\\nGPo to your profile page (top right drop-down --> profile)\\nThen go to under Credentials --> Analytics (you may have customised this name)\\nClick on Bigquery >\\nHit Edit\\nUpdate your location, you may need to re-upload your service account JSON to re-fetch your private key, and save. (NOTE: be sure to exactly copy the region BigQuery specifies your dataset is in.)',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': '404 Not found: Dataset eighth-zenith-372015:trip_data_all was not found in location us-west1'},\n",
       "  {'text': 'Error: `dbt_utils.surrogate_key` has been replaced by `dbt_utils.generate_surrogate_key`\\nFix:\\nReplace dbt_utils.surrogate_key  with dbt_utils.generate_surrogate_key in stg_green_tripdata.sql\\nWhen executing dbt run after fact_trips.sql has been created, the task failed with error:\\nR: “Access Denied: BigQuery BigQuery: Permission denied while globbing file pattern.”\\n1. Fixed by adding the Storage Object Viewer role to the service account in use in BigQuery.\\n2. Add the related roles to the service account in use in GCS.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'When executing dbt run after installing dbt-utils latest version i.e., 1.0.0 warning has generated'},\n",
       "  {'text': 'You need to create packages.yml file in main project directory and add packages’ meta data:\\npackages:\\n- package: dbt-labs/dbt_utils\\nversion: 0.8.0\\nAfter creating file run:\\nAnd hit enter.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'When You are getting error dbt_utils not found'},\n",
       "  {'text': \"Ensure you properly format your yml file. Check the build logs if the run was completed successfully. You can expand the command history console (where you type the --vars '{'is_test_run': 'false'}')  and click on any stage’s logs to expand and read errors messages or warnings.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Lineage is currently unavailable. Check that your project does not contain compilation errors or contact support if this error persists.'},\n",
       "  {'text': \"Make sure you use:\\ndbt run --var ‘is_test_run: false’ or\\ndbt build --var ‘is_test_run: false’\\n(watch out for formatted text from this document: re-type the single quotes). If that does not work, use --vars '{'is_test_run': 'false'}' with each phrase separately quoted.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Build - Why do my Fact_trips only contain a few days of data?'},\n",
       "  {'text': 'Check if you specified if_exists argument correctly when writing data from GCS to BigQuery. When I wrote my automated flow for each month of the years 2019 and 2020 for green and yellow data I had specified if_exists=\"replace\" while I was experimenting with the flow setup. Once you want to run the flow for all months in 2019 and 2020 make sure to set if_exists=\"append\"\\nif_exists=\"replace\" will replace the whole table with only the month data that you are writing into BigQuery in that one iteration -> you end up with only one month in BigQuery (the last one you inserted)\\nif_exists=\"append\" will append the new monthly data -> you end up with data from all months',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Build - Why do my fact_trips only contain one month of data?'},\n",
       "  {'text': \"R: After the second SELECT, change this line:\\ndate_trunc('month', pickup_datetime) as revenue_month,\\nTo this line:\\ndate_trunc(pickup_datetime, month) as revenue_month,\\nMake sure that “month” isn’t surrounded by quotes!\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'BigQuery returns an error when I try to run the dm_monthly_zone_revenue.sql model.'},\n",
       "  {'text': 'For this instead:\\n{{ dbt_utils.generate_surrogate_key([ \\n     field_a, \\n     field_b, \\n     field_c,\\n     …,\\n     field_z\\n]) }}',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Replace: \\n{{ dbt_utils.surrogate_key([ \\n     field_a, \\n     field_b, \\n     field_c,\\n     …,\\n     field_z     \\n]) }}'},\n",
       "  {'text': 'Remove the dataset from BigQuery which was created by dbt and run dbt run again so that it will recreate the dataset in BigQuery with the correct location',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'I changed location in dbt, but dbt run still gives me an error'},\n",
       "  {'text': 'Remove the dataset from BigQuery created by dbt and run again (with test disabled) to ensure the dataset created has all the rows.\\nDBT - Why am I getting a new dataset after running my CI/CD Job? / What is this new dbt dataset in BigQuery?\\nAnswer: when you create the CI/CD job, under ‘Compare Changes against an environment (Deferral) make sure that you select ‘ No; do not defer to another environment’ - otherwise dbt won’t merge your dev models into production models; it will create a new environment called ‘dbt_cloud_pr_number of pull request’',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'I ran dbt run without specifying variable which gave me a table of 100 rows. I ran again with the variable value specified but my table still has 100 rows in BQ.'},\n",
       "  {'text': \"Vic created three different datasets in the videos.. dbt_<name> was used for development and you used a production dataset for the production environment. What was the use for the staging dataset?\\nR: Staging, as the name suggests, is like an intermediate between the raw datasets and the fact and dim tables, which are the finished product, so to speak. You'll notice that the datasets in staging are materialised as views and not tables.\\nVic didn't use it for the project, you just need to create production and dbt_name + trips_data_all that you had already.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Why do we need the Staging dataset?'},\n",
       "  {'text': 'Try removing the “network: host” line in docker-compose.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT Docs Served but Not Accessible via Browser'},\n",
       "  {'text': 'Go to Account settings >> Project >> Analytics >> Click on your connection >> go all the way down to Location and type in the GCP location just as displayed in GCP (e.g. europe-west6). You might need to reupload your GCP key.\\nDelete your dataset in GBQ\\nRebuild project: dbt build\\nNewly built dataset should be in the correct location',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'BigQuery adapter: 404 Not found: Dataset was not found in location europe-west6'},\n",
       "  {'text': 'Create a new branch to edit. More on this can be found here in the dbt docs.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Dbt+git - Main branch is “read-only”'},\n",
       "  {'text': 'Create a new branch for development, then you can merge it to the main branch\\nCreate a new branch and switch to this branch. It allows you to make changes. Then you can commit and push the changes to the “main” branch.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Dbt+git - It appears that I can't edit the files because I'm in read-only mode. Does anyone know how I can change that?\"},\n",
       "  {'text': \"Error:\\nTriggered by pull requests\\nThis feature is only available for dbt repositories connected through dbt Cloud's native integration with Github, Gitlab, or Azure DevOps\\nSolution: Contrary to the guide on DTC repo, don’t use the Git Clone option. Use the Github one instead. Step-by-step guide to UN-LINK Git Clone and RE-LINK with Github in the next entry below\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Dbt deploy + Git CI - cannot create CI checks job for deployment to Production. See more discussion in slack chat'},\n",
       "  {'text': 'If you’re trying to configure CI with Github and on the job’s options you can’t see Run on Pull Requests? on triggers, you have to reconnect with Github using native connection instead clone by SSH. Follow these steps:\\nOn Profile Settings > Linked Accounts connect your Github account with dbt project allowing the permissions asked. More info at https://docs.getdbt.com/docs/collaborate/git/connect-gith\\nDisconnect your current Github’s configuration from Account Settings > Projects (analytics) > Github connection. At the bottom left appears the button Disconnect, press it.\\nOnce we have confirmed the change, we can configure it again. This time, choose Github and it will appear in all repositories which you have allowed to work with dbt. Select your repository and it’s ready.\\nGo to the Deploy > job configuration’s page and go down until Triggers and now you can see the option Run on Pull Requests:',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Dbt deploy + Git CI - Unable to configure Continuous Integration (CI) with Github'},\n",
       "  {'text': \"If you're following video DE Zoomcamp 4.3.1 - Building the First DBT Models, you may have encountered an issue at 14:25 where the Lineage graph isn't displayed and a Compilation Error occurs, as shown in the attached image. Don't worry - a quick fix for this is to simply save your schema.yml file. Once you've done this, you should be able to view your Lineage graph without any further issues.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Compilation Error (Model 'model.my_new_project.stg_green_tripdata' (models/staging/stg_green_tripdata.sql) depends on a source named 'staging.green_trip_external' which was not found)\"},\n",
       "  {'text': '> in macro test_accepted_values (tests/generic/builtin.sql)\\n> called by test accepted_values_stg_green_tripdata_Payment_type__False___var_payment_type_values_ (models/staging/schema.yml)\\nRemember that you have to add to dbt_project.yml the vars:\\nvars:\\npayment_type_values: [1, 2, 3, 4, 5, 6]',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"'NoneType' object is not iterable\"},\n",
       "  {'text': \"You will face this issue if you copied and pasted the exact macro directly from data-engineering-zoomcamp repo.\\nBigQuery adapter: Retry attempt 1 of 1 after error: BadRequest('No matching signature for operator CASE for argument types: STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, NULL at [35:5]; reason: invalidQuery, location: query, message: No matching signature for operator CASE for argument types: STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, NULL at [35:5]')\\nWhat you’d have to do is to change the data type of the numbers (1, 2, 3 etc.) to text by inserting ‘’, as the initial ‘payment_type’ data type should be string (Note: I extracted and loaded the green trips data using Google BQ Marketplace)\\n{#\\nThis macro returns the description of the payment_type\\n#}\\n{% macro get_payment_type_description(payment_type) -%}\\ncase {{ payment_type }}\\nwhen '1' then 'Credit card'\\nwhen '2' then 'Cash'\\nwhen '3' then 'No charge'\\nwhen '4' then 'Dispute'\\nwhen '5' then 'Unknown'\\nwhen '6' then 'Voided trip'\\nend\\n{%- endmacro %}\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'dbt macro errors with get_payment_type_description(payment_type)'},\n",
       "  {'text': 'The dbt error  log contains a link to BigQuery. When you follow it you will see your query and the problematic line will be highlighted.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Troubleshooting in dbt:'},\n",
       "  {'text': 'It is a default behaviour of dbt to append custom schema to initial schema. To override this behaviour simply create a macro named “generate_schema_name.sql”:\\n{% macro generate_schema_name(custom_schema_name, node) -%}\\n{%- set default_schema = target.schema -%}\\n{%- if custom_schema_name is none -%}\\n{{ default_schema }}\\n{%- else -%}\\n{{ custom_schema_name | trim }}\\n{%- endif -%}\\n{%- endmacro %}\\nNow you can override default custom schema in “dbt_project.yml”:',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Why changing the target schema to “marts” actually creates a schema named “dbt_marts” instead?'},\n",
       "  {'text': 'There is a project setting which allows you to set `Project subdirectory` in dbt cloud:',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'How to set subdirectory of the github repository as the dbt project root'},\n",
       "  {'text': \"Remember that you should modify accordingly your .sql models, to read from existing table names in BigQuery/postgres db\\nExample: select * from {{ source('staging',<your table name in the database>') }}\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Compilation Error : Model 'model.XXX' (models/<model_path>/XXX.sql) depends on a source named '<a table name>' which was not found\"},\n",
       "  {'text': 'Make sure that you create a pull request from your Development branch to the Production branch (main by default). After that, check in your ‘seeds’ folder if the seed file is inside it.\\nAnother thing to check is your .gitignore file. Make sure that the .csv extension is not included.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Compilation Error : Model '<model_name>' (<model_path>) depends on a node named '<seed_name>' which was not found   (Production Environment)\"},\n",
       "  {'text': '1. Go to your dbt cloud service account\\n1. Adding the  [Storage Object Admin,Storage Admin] role in addition tco BigQuery Admin.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'When executing dbt run after using fhv_tripdata as an external table: you get “Access Denied: BigQuery BigQuery: Permission denied”'},\n",
       "  {'text': 'Problem: when injecting data to bigquery, you may face the type error. This is because pandas by default will parse integer columns with missing value as float type.\\nSolution:\\nOne way to solve this problem is to specify/ cast data type Int64 during the data transformation stage.\\nHowever, you may be lazy to type all the int columns. If that is the case, you can simply use convert_dtypes to infer the data type\\n# Make pandas to infer correct data type (as pandas parse int with missing as float)\\ndf.fillna(-999999, inplace=True)\\ndf = df.convert_dtypes()\\ndf = df.replace(-999999, None)',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'How to automatically infer the column data type (pandas missing value issues)?'},\n",
       "  {'text': 'Seed files loaded from directory with name ‘seed’, that’s why you should rename dir with name ‘data’ to ‘seed’',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'When loading github repo raise exception that ‘taxi_zone_lookup’ not found'},\n",
       "  {'text': 'Check the .gitignore file and make sure you don’t have *.csv in it\\n\\nDbt error 404 was not found in location\\nMy specific error:\\nRuntime Error in rpc request (from remote system.sql) 404 Not found: Table dtc-de-0315:trips_data_all.green_tripdata_partitioned was not found in location europe-west6 Location: europe-west6 Job ID: 168ee9bd-07cd-4ca4-9ee0-4f6b0f33897c\\nMake sure all of your datasets have the correct region and not a generalised region:\\nEurope-west6 as opposed to EU\\n\\nMatch this in dbt settings:\\ndbt -> projects -> optional settings -> manually set location to match',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': '‘taxi_zone_lookup’ not found'},\n",
       "  {'text': \"The easiest way to avoid these errors is by ingesting the relevant data in a .csv.gz file type. Then, do:\\nCREATE OR REPLACE EXTERNAL TABLE `dtc-de.trips_data_all.fhv_tripdata`\\nOPTIONS (\\nformat = 'CSV',\\nuris = ['gs://dtc_data_lake_dtc-de-updated/data/fhv_all/fhv_tripdata_2019-*.csv.gz']\\n);\\nAs an example. You should no longer have any data type issues for week 4.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Data type errors when ingesting with parquet files'},\n",
       "  {'text': 'This is due to the way the deduplication is done in the two staging files.\\nSolution: add order by in the partition by part of both staging files. Keep adding columns to order by until the number of rows in the fact_trips table is consistent when re-running the fact_trips model.\\nExplanation (a bit convoluted, feel free to clarify, correct etc.)\\nWe partition by vendor id and pickup_datetime and choose the first row (rn=1) from all these partitions. These partitions are not ordered, so every time we run this, the first row might be a different one. Since the first row is different between runs, it might or might not contain an unknown borough. Then, in the fact_trips model we will discard a different number of rows when we discard all values with an unknown borough.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Inconsistent number of rows when re-running fact_trips model'},\n",
       "  {'text': 'If you encounter data type error on trip_type column, it may due to some nan values that isn’t null in bigquery.\\nSolution: try casting it to FLOAT datatype instead of NUMERIC',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Data Type Error when running fact table'},\n",
       "  {'text': \"This error could result if you are using some select * query without mentioning the name of table for ex:\\nwith dim_zones as (\\nselect * from `engaged-cosine-374921`.`dbt_victoria_mola`.`dim_zones`\\nwhere borough != 'Unknown'\\n),\\nfhv as (\\nselect * from `engaged-cosine-374921`.`dbt_victoria_mola`.`stg_fhv_tripdata`\\n)\\nselect * from fhv\\ninner join dim_zones as pickup_zone\\non fhv.PUlocationID = pickup_zone.locationid\\ninner join dim_zones as dropoff_zone\\non fhv.DOlocationID = dropoff_zone.locationid\\n);\\nTo resolve just replace use : select fhv.* from fhv\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'CREATE TABLE has columns with duplicate name locationid.'},\n",
       "  {'text': 'Some ehail fees are null and casting them to integer gives Bad int64 value: 0.0 error,\\nSolution:\\nUsing safe_cast returns NULL instead of throwing an error. So use safe_cast from dbt_utils function in the jinja code for casting into integer as follows:\\n{{ dbt_utils.safe_cast(\\'ehail_fee\\',  api.Column.translate_type(\"integer\"))}} as ehail_fee,\\nCan also just use safe_cast(ehail_fee as integer) without relying on dbt_utils.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Bad int64 value: 0.0 error'},\n",
       "  {'text': \"You might encounter this when building the fact_trips.sql model. The issue may be with the payment_type_description field.\\nUsing safe_cast as above, would cause the entire field to become null. A better approach is to drop the offending decimal place, then cast to integer.\\ncast(replace({{ payment_type }},'.0','') as integer)\\nBad int64 value: 1.0 error (again)\\n\\nI found that there are more columns causing the bad INT64: ratecodeid and trip_type on Green_tripdata table.\\nYou can use the queries below to address them:\\nCAST(\\nREGEXP_REPLACE(CAST(rate_code AS STRING), r'\\\\.0', '') AS INT64\\n) AS ratecodeid,\\nCAST(\\nCASE\\nWHEN REGEXP_CONTAINS(CAST(trip_type AS STRING), r'\\\\.\\\\d+') THEN NULL\\nELSE CAST(trip_type AS INT64)\\nEND AS INT64\\n) AS trip_type,\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Bad int64 value: 2.0/1.0 error'},\n",
       "  {'text': 'The two solution above don’t work for me - I used the line below in `stg_green_trips.sql` to replace the original ehail_fee line:\\n`{{ dbt.safe_cast(\\'ehail_fee\\',  api.Column.translate_type(\"numeric\"))}} as ehail_fee,`',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT - Error on building fact_trips.sql: Parquet column \\'ehail_fee\\' has type DOUBLE which does not match the target cpp_type INT64. File: gs://<gcs bucket>/<table>/green_taxi_2019-01.parquet\")'},\n",
       "  {'text': \"Remember to add a space between the variable and the value. Otherwise, it won't be interpreted as a dictionary.\\nIt should be:\\ndbt run --var 'is_test_run: false'\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'The - vars argument must be a YAML dictionary, but was of type str'},\n",
       "  {'text': \"You don't need to change the environment type. If you are following the videos, you are creating a Production Deployment, so the only available option is the correct one.'\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Not able to change Environment Type as it is greyed out and inaccessible'},\n",
       "  {'text': 'Database Error in model stg_yellow_tripdata (models/staging/stg_yellow_tripdata.sql)\\nAccess Denied: Table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata: User does not have permission to query table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata, or perhaps it does not exist in location US.\\ncompiled Code at target/run/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql\\nIn my case, I was set up in a different branch, so always check the branch you are working on. Change the 04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml file in the\\nsources:\\n- name: staging\\ndatabase: your_database_name\\nIf this error will continue when running dbt job, As for changing the branch for your job, you can use the ‘Custom Branch’ settings in your dbt Cloud environment. This allows you to run your job on a different branch than the default one (usually main). To do this, you need to:\\nGo to an environment and select Settings to edit it\\nSelect Only run on a custom branch in General settings\\nEnter the name of your custom branch (e.g. HW)\\nClick Save\\nCould not parse the dbt project. please check that the repository contains a valid dbt project\\nRunning the Environment on the master branch causes this error, you must activate “Only run on a custom branch” checkbox and specify the branch you are  working when Environment is setup.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Access Denied: Table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata: User does not have permission to query table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata, or perhaps it does not exist in location US.'},\n",
       "  {'text': 'Change to main branch, make a pull request from the development branch.\\nNote: this will take you to github.\\nApprove the merging and rerun you job, it would work as planned now',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Made change to your modelling files and commit the your development branch, but Job still runs on old file?'},\n",
       "  {'text': 'Before you can develop some data model on dbt, you should create development environment and set some parameter on it. After the model being developed, we should also create deployment environment to create and run some jobs.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Setup - I’ve set Github and Bigquery to dbt successfully. Why nothing showed in my Develop tab?'},\n",
       "  {'text': 'Error Message:\\nInvestigate Sentry error: ProtocolError \"Invalid input ConnectionInputs.SEND_HEADERS in state ConnectionState.CLOSED\"\\nSolution:\\nreference\\nRun it again because it happens sometimes. Or wait a few minutes, it will continue.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Prefect Agent retrieving runs from queue sometimes fails with httpx.LocalProtocolError'},\n",
       "  {'text': \"My taxi data was loaded into gcs with etl_web_to_gcs.py script that converts csv data into parquet. Then I placed raw data trips into external tables and when I executed dbt run I got an error message: Parquet column 'passenger_count' has type INT64 which does not match the target cpp_type DOUBLE. It is because several columns in files have different formats of data.\\nWhen I added df[col] = df[col].astype('Int64') transformation to the columns: passenger_count, payment_type, RatecodeID, VendorID, trip_type it went ok. Several people also faced this error and more about it you can read on the slack channel.\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'BigQuery returns an error when i try to run ‘dbt run’:'},\n",
       "  {'text': 'Use the syntax below instead if the code in the tutorial is not working.\\ndbt run --select stg_green_tripdata --vars \\'{\"is_test_run\": false}\\'',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Running dbt run --models stg_green_tripdata --var 'is_test_run: false' is not returning anything:\"},\n",
       "  {'text': \"Following dbt with BigQuery on Docker readme.md, after `docker-compose build` and `docker-compose run dbt-bq-dtc init`, encountered error `ModuleNotFoundError: No module named 'pytz'`\\nSolution:\\nAdd `RUN python -m pip install --no-cache pytz` in the Dockerfile under `FROM --platform=$build_for python:3.9.9-slim-bullseye as base`\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"DBT - Error: No module named 'pytz' while setting up dbt with docker\"},\n",
       "  {'text': \"If you have problems editing dbt_project.yml when using Docker after ‘docker-compose run dbt-bq-dtc init’, to change profile ‘taxi_rides_ny’ to 'bq-dbt-workshop’, just run:\\nsudo chown -R username path\\nDBT - Internal Error: Profile should not be None if loading is completed\\nWhen  running dbt debug, change the directory to the newly created subdirectory (e.g: the newly created `taxi_rides_ny` directory, which contains the dbt project).\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': '\\u200b\\u200bVS Code: NoPermissions (FileSystemError): Error: EACCES: permission denied (linux)'},\n",
       "  {'text': 'When running a query on BigQuery sometimes could appear a this table is not on the specified location error.\\nFor this problem there is not a straightforward solution, you need to dig a little, but the problem could be one of these:\\nCheck the locations of your bucket, datasets and tables. Make sure they are all on the same one.\\nChange the query settings to the location you are in: on the query window select more -> query settings -> select the location\\nCheck if all the paths you are using in your query to your tables are correct: you can click on the table -> details -> and copy the path.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Google Cloud BigQuery Location Problems'},\n",
       "  {'text': 'This happens because we have moved the dbt project to another directory on our repo.\\nOr might be that you’re on a different branch than is expected to be merged from / to.\\nSolution:\\nGo to the projects window on dbt cloud -> settings -> edit -> and add directory (the extra path to the dbt project)\\nFor example:\\n/week5/taxi_rides_ny\\nMake sure your file explorer path and this Project settings path matches and there’s no files waiting to be committed to github if you’re running the job to deploy to PROD.\\nAnd that you had setup the PROD environment to check in the main branch, or whichever you specified.\\nIn the picture below, I had set it to ella2024 to be checked as “production-ready” by the “freshness” check mark at the PROD environment settings. So each time I merge a branch from something else into ella2024 and then trigger the PR, the CI check job would kick-in. But we still do need to Merge and close the PR manually, I believe, that part is not automated.\\nYou set up the PROD custom branch (if not default main) in the Environment setup screen.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT Deploy - This dbt Cloud run was cancelled because a valid dbt project was not found.'},\n",
       "  {'text': 'When you are creating the pull request and running the CI, dbt is creating a new schema on BIgQuery. By default that new schema will be created on ‘US’ location, if you have your dataset, schemas and tables on ‘EU’ that will generate an error and the pull request will not be accepted. To change that location to ‘EU’ on the connection to BigQuery from dbt we need to add the location ‘EU’ on the connection optional settings:\\nDbt -> project -> settings -> connection BIgQuery -> OPtional Settings -> Location -> EU',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT Deploy + CI - Location Problems on BigQuery'},\n",
       "  {'text': 'When running trying to run the dbt project on prod there is some things you need to do and check on your own:\\nFirst Make the pull request and Merge the branch into the main.\\nMake sure you have the latest version, if you made changes to the repo in another place.\\nCheck if the dbt_project.yml file is accessible to the project, if not check this solution (Dbt: This dbt Cloud run was cancelled because a valid dbt project was not found.).\\nCheck if the name you gave to the dataset on BigQuery is the same you put on the dataset spot on the production environment created on dbt cloud.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT Deploy - Error When trying to run the dbt project on Prod'},\n",
       "  {'text': 'In the step in this video (DE Zoomcamp 4.3.1 - Build the First dbt Models), after creating `stg_green_tripdata.sql` and clicking `build`, I encountered an error saying dataset not found in location EU. The default location for dbt Bigquery is the US, so when generating the new Bigquery schema for dbt, unless specified, the schema locates in the US.\\nSolution:\\nTurns out I forgot to specify Location to be `EU` when adding connection details.\\nDevelop -> Configure Cloud CLI -> Projects -> taxi_rides_ny -> (connection) Bigquery -> Edit -> Location (Optional) -> type `EU` -> Save',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'DBT - Error: “404 Not found: Dataset <dataset_name>:<dbt_schema_name> was not found in location EU” after building from stg_green_tripdata.sql'},\n",
       "  {'text': 'Issue: If you’re having problems loading the FHV_20?? data from the github repo into GCS and then into BQ (input file not of type parquet), you need to do two things. First, append the URL Template link with ‘?raw=true’ like so:\\nURL_TEMPLATE = URL_PREFIX + \"/fhv_tripdata_{{ execution_date.strftime(\\\\\\'%Y-%m\\\\\\') }}.parquet?raw=true\"\\nSecond, update make sure the URL_PREFIX is set to the following value:\\n\\nURL_PREFIX = \"https://github.com/alexeygrigorev/datasets/blob/master/nyc-tlc/fhv\"\\nIt is critical that you use this link with the keyword blob. If your link has ‘tree’ here, replace it. Everything else can stay the same, including the curl -sSLf command. ‘',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Homework - Ingesting FHV_20?? data'},\n",
       "  {'text': 'I found out that the easies way to upload datasets form github for the homework is utilising this script git_csv_to_gcs.py. Thank you Lidia!!\\nIt is similar to a script that Alexey provided us in 03-data-warehouse/extras/web_to_gcs.py',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Homework - Ingesting NYC TLC Data'},\n",
       "  {'text': 'If you have to securely put your credentials for a project and, probably, push it to a git repository then the best option is to use an environment variable\\nFor example for web_to_gcs.py or git_csv_to_gcs.py we have to set these variables:\\nGOOGLE_APPLICATION_CREDENTIALS\\nGCP_GCS_BUCKET\\nThe easises option to do it  is to use .env  (dotenv).\\nInstall it and add a few lines of code that inject these variables for your project\\npip install python-dotenv\\nfrom dotenv import load_dotenv\\nimport os\\n# Load environment variables from .env file\\nload_dotenv()\\n# Now you can access environment variables like GCP_GCS_BUCKET and GOOGLE_APPLICATION_CREDENTIALS\\ncredentials_path = os.getenv(\"GOOGLE_APPLICATION_CREDENTIALS\")\\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\")',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'How to set environment variable easily for any credentials'},\n",
       "  {'text': \"If you uploaded manually the fvh 2019 csv files, you may face errors regarding date types. Try to create an the external table in bigquery but define the pickup_datetime and dropoff_datetime to be strings\\nCREATE OR REPLACE EXTERNAL TABLE `gcp_project.trips_data_all.fhv_tripdata`  (\\ndispatching_base_num STRING,\\npickup_datetime STRING,\\ndropoff_datetime STRING,\\nPUlocationID STRING,\\nDOlocationID STRING,\\nSR_Flag STRING,\\nAffiliated_base_number STRING\\n)\\nOPTIONS (\\nformat = 'csv',\\nuris = ['gs://bucket/*.csv']\\n);\\nThen when creating the fhv core model in dbt, use TIMESTAMP(CAST(()) to ensure it first parses as a string and then convert it to timestamp.\\nwith fhv_tripdata as (\\nselect * from {{ ref('stg_fhv_tripdata') }}\\n),\\ndim_zones as (\\nselect * from {{ ref('dim_zones') }}\\nwhere borough != 'Unknown'\\n)\\nselect fhv_tripdata.dispatching_base_num,\\nTIMESTAMP(CAST(fhv_tripdata.pickup_datetime AS STRING)) AS pickup_datetime,\\nTIMESTAMP(CAST(fhv_tripdata.dropoff_datetime AS STRING)) AS dropoff_datetime,\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': \"Invalid date types after Ingesting FHV data through CSV files: Could not parse 'pickup_datetime' as a timestamp\"},\n",
       "  {'text': \"If you uploaded manually the fvh 2019 parquet files manually after downloading from https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-*.parquet you may face errors regarding date types while loading the data in a landing table (say fhv_tripdata). Try to create an the external table with the schema defines as following and load each month in a loop.\\n-----Correct load with schema defination----will not throw error----------------------\\nCREATE OR REPLACE EXTERNAL TABLE `dw-bigquery-week-3.trips_data_all.external_tlc_fhv_trips_2019` (\\ndispatching_base_num STRING,\\npickup_datetime TIMESTAMP,\\ndropoff_datetime TIMESTAMP,\\nPUlocationID FLOAT64,\\nDOlocationID FLOAT64,\\nSR_Flag FLOAT64,\\nAffiliated_base_number STRING\\n)\\nOPTIONS (\\nformat = 'PARQUET',\\nuris = ['gs://project id/fhv_2019_8.parquet']\\n);\\nCan Also USE  uris = ['gs://project id/fhv_2019_*.parquet'] (THIS WILL remove the need for the loop and can be done for all month in single RUN )\\n– THANKYOU FOR THIS –\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Invalid data types after Ingesting FHV data through parquet files: Could not parse SR_Flag as Float64,Couldn’t parse datetime column as timestamp,couldn’t handle NULL values in PULocationID,DOLocationID'},\n",
       "  {'text': 'When accessing Looker Studio through the Google Cloud Project console, you may be prompted to subscribe to the Pro version and receive the following errors:\\nInstead, navigate to https://lookerstudio.google.com/navigation/reporting which will take you to the free version.',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'Google Looker Studio - you have used up your 30-day trial'},\n",
       "  {'text': 'Ans: Dbt provides a mechanism called \"ref\" to manage dependencies between models. By referencing other models using the \"ref\" keyword in SQL, dbt automatically understands the dependencies and ensures the correct execution order.\\nLoading FHV Data goes into slumber using Mage?\\nTry loading the data using jupyter notebooks in a local environment. There might be bandwidth issues with Mage.\\nLoad the data into a pandas dataframe using the urls, make necessary transformations, upload the gcp bucket / alternatively download the parquet/csv files locally and then upload to GCP manually.\\nRegion Mismatch in DBT and BigQuery\\nIf you are using the datasets copied into BigQuery from BigQuery public datasets, the region will be set as US by default and hence it is much easier to set your dbt profile location as US while transforming the tables and views. \\nYou can change the location as follows:',\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'How does dbt handle dependencies between models?'},\n",
       "  {'text': \"Use the PostgreSQL COPY FROM feature that is compatible with csv files\\nCOPY table_name [ ( column_name [, ...] ) ]\\nFROM { 'filename' | PROGRAM 'command' | STDIN }\\n[ [ WITH ] ( option [, ...] ) ]\\n[ WHERE condition ]\",\n",
       "   'section': 'Module 4: analytics engineering with dbt',\n",
       "   'question': 'What is the fastest way to upload taxi data to dbt-postgres?'},\n",
       "  {'text': 'Update the line:\\nWith:',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'When configuring the profiles.yml file for dbt-postgres with jinja templates with environment variables, I\\'m getting \"Credentials in profile \"PROFILE_NAME\", target: \\'dev\\', invalid: \\'5432\\'is not of type \\'integer\\''},\n",
       "  {'text': 'Install SDKMAN:\\ncurl -s \"https://get.sdkman.io\" | bash\\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\"\\nUsing SDKMAN, install Java 11 and Spark 3.3.2:\\nsdk install java 11.0.22-tem\\nsdk install spark 3.3.2\\nOpen a new terminal or run the following in the same shell:\\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\"\\nVerify the locations and versions of Java and Spark that were installed:\\necho $JAVA_HOME\\njava -version\\necho $SPARK_HOME\\nspark-submit --version',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Setting up Java and Spark (with PySpark) on Linux (Alternative option using SDKMAN)'},\n",
       "  {'text': 'If you’re seriously struggling to set things up \"locally\" (here locally meaning non/partly-managed environment like own laptop, a VM or Codespaces) you can use the following guide to use Spark in Google Colab:\\nhttps://medium.com/gitconnected/launch-spark-on-google-colab-and-connect-to-sparkui-342cad19b304\\nStarter notebook:\\nhttps://github.com/aaalexlit/medium_articles/blob/main/Spark_in_Colab.ipynb\\nIt’s advisable to spend some time setting things up locally rather than jumping right into this solution.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'PySpark - Setting Spark up in Google Colab'},\n",
       "  {'text': 'If after installing Java (either jdk or openjdk), Hadoop and Spark, and setting the corresponding environment variables you find the following error when spark-shell is run at CMD:\\njava.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c947bc5) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed\\nmodule @0x3c947bc5\\nSolution: Java 17 or 19 is not supported by Spark. Spark 3.x: requires Java 8/11/16. Install Java 11 from the website provided in the windows.md setup file.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark-shell: unable to load native-hadoop library for platform - Windows'},\n",
       "  {'text': 'I found this error while executing the user defined function in Spark (crazy_stuff_udf). I am working on Windows and using conda. After following the setup instructions, I found that the PYSPARK_PYTHON environment variable was not set correctly, given that conda has different python paths for each environment.\\nSolution:\\npip install findspark on the command line inside proper environment\\nAdd to the top of the script\\nimport findspark\\nfindspark.init()',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'PySpark - Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.'},\n",
       "  {'text': 'This is because Python 3.11 has some inconsistencies with such an old version of Spark. The solution is a downgrade in the Python version. Python 3.9 using a conda environment takes care of it. Or install newer PySpark >= 3.5.1 works for me (Ella) [source].',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'PySpark - TypeError: code() argument 13 must be str, not int  , while executing `import pyspark`  (Windows/ Spark 3.0.3 - Python 3.11)'},\n",
       "  {'text': 'If anyone is a Pythonista or becoming one (which you will essentially be one along this journey), and desires to have all python dependencies under same virtual environment (e.g. conda) as done with prefect and previous exercises, simply follow these steps\\nInstall OpenJDK 11,\\non MacOS: $ brew install java11\\nAdd export PATH=\"/opt/homebrew/opt/openjdk@11/bin:$PATH\"\\nto ~/.bashrc or ~/zshrc\\nActivate working environment (by pipenv / poetry / conda)\\nRun $ pip install pyspark\\nWork with exercises as normal\\nAll default commands of spark will be also available at shell session under activated enviroment.\\nHope this can help!\\nP.s. you won’t need findspark to firstly initialize.\\nPy4J - Py4JJavaError: An error occurred while calling (...)  java.net.ConnectException: Connection refused: no further information;\\nIf you\\'re getting `Py4JavaError` with a generic root cause, such as the described above (Connection refused: no further information). You\\'re most likely using incompatible versions of the JDK or Python with Spark.\\nAs of the current latest Spark version (3.5.0), it supports JDK 8 / 11 / 17. All of which can be easily installed with SDKMan! on macOS or Linux environments\\n\\n$ sdk install java 17.0.10-librca\\n$ sdk install spark 3.5.0\\n$ sdk install hadoop 3.3.5\\nAs PySpark 3.5.0 supports Python 3.8+ make sure you\\'re setting up your virtualenv with either 3.8 / 3.9 / 3.10 / 3.11 (Most importantly avoid using 3.12 for now as not all libs in the data-science/engineering ecosystem are fully package for that)\\n\\n\\n$ conda create -n ENV_NAME python=3.11\\n$ conda activate ENV_NAME\\n$ pip install pyspark==3.5.0\\nThis setup makes installing `findspark` and the likes of it unnecessary. Happy coding.\\nPy4J - Py4JJavaError: An error occurred while calling o54.parquet. Or any kind of Py4JJavaError that show up after run df.write.parquet(\\'zones\\')(On window)\\nThis assume you already correctly set up the PATH in the nano ~/.bashrc\\nHere my\\nexport JAVA_HOME=\"/c/tools/jdk-11.0.21\"\\nexport PATH=\"${JAVA_HOME}/bin:${PATH}\"\\nexport HADOOP_HOME=\"/c/tools/hadoop-3.2.0\"\\nexport PATH=\"${HADOOP_HOME}/bin:${PATH}\"\\nexport SPARK_HOME=\"/c/tools/spark-3.3.2-bin-hadoop3\"\\nexport PATH=\"${SPARK_HOME}/bin:${PATH}\"\\nexport PYTHONPATH=\"${SPARK_HOME}/python/:$PYTHONPATH\"\\nexport PYTHONPATH=\"${SPARK_HOME}spark-3.5.1-bin-hadoop3py4j-0.10.9.5-src.zip:$PYTHONPATH\"\\nYou also need to add environment variables correctly which paths to java jdk, spark and hadoop through\\nGo to Stephenlaye2/winutils3.3.0: winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows (github.com), download the right winutils for hadoop-3.2.0. Then create a new folder,bin and put every thing in side to make a /c/tools/hadoop-3.2.0/bin(You might not need to do this, but after testing it without the /bin I could not make it to work)\\nThen follow the solution in this video: How To Resolve Issue with Writing DataFrame to Local File | winutils | msvcp100.dll (youtube.com)\\nRemember to restart IDE and computer, After the error An error occurred while calling o54.parquet.  is fixed but new errors like o31.parquet. Or o35.parquet. appear.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Java+Spark - Easy setup with miniconda env (worked on MacOS)'},\n",
       "  {'text': 'After installing all including pyspark (and it is successfully imported), but then running this script on the jupyter notebook\\nimport pyspark\\nfrom pyspark.sql import SparkSession\\nspark = SparkSession.builder \\\\\\n.master(\"local[*]\") \\\\\\n.appName(\\'test\\') \\\\\\n.getOrCreate()\\ndf = spark.read \\\\\\n.option(\"header\", \"true\") \\\\\\n.csv(\\'taxi+_zone_lookup.csv\\')\\ndf.show()\\nit gives the error:\\nRuntimeError: Java gateway process exited before sending its port number\\n✅The solution (for me) was:\\npip install findspark on the command line and then\\nAdd\\nimport findspark\\nfindspark.init()\\nto the top of the script.\\nAnother possible solution is:\\nCheck that pyspark is pointing to the correct location.\\nRun pyspark.__file__. It should be list /home/<your user name>/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py if you followed the videos.\\nIf it is pointing to your python site-packages remove the pyspark directory there and check that you have added the correct exports to you .bashrc file and that there are not any other exports which might supersede the ones provided in the course content.\\nTo add to the solution above, if the errors persist in regards to setting the correct path for spark,  an alternative solution for permanent path setting solve the error is  to set environment variables on system and user environment variables following this tutorial: Install Apache PySpark on Windows PC | Apache Spark Installation Guide\\nOnce everything is installed, skip to 7:14 to set up environment variables. This allows for the environment variables to be set permanently.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'lsRuntimeError: Java gateway process exited before sending its port number'},\n",
       "  {'text': 'Even after installing pyspark correctly on linux machine (VM ) as per course instructions, faced a module not found error in jupyter notebook .\\nThe solution which worked for me(use following in jupyter notebook) :\\n!pip install findspark\\nimport findspark\\nfindspark.init()\\nThereafter , import pyspark and create spark contex<<t as usual\\nNone of the solutions above worked for me till I ran !pip3 install pyspark instead !pip install pyspark.\\nFilter based on conditions based on multiple columns\\nfrom pyspark.sql.functions import col\\nnew_final.filter((new_final.a_zone==\"Murray Hill\") & (new_final.b_zone==\"Midwood\")).show()\\nKrishna Anand',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Module Not Found Error in Jupyter Notebook .'},\n",
       "  {'text': 'You need to look for the Py4J file and note the version of the filename. Once you know the version, you can update the export command accordingly, this is how you check yours:\\n` ls ${SPARK_HOME}/python/lib/ ` and then you add it in the export command, mine was:\\nexport PYTHONPATH=”${SPARK_HOME}/python/lib/Py4J-0.10.9.5-src.zip:${PYTHONPATH}”\\nMake sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of py4j or you will encounter `ModuleNotFoundError: No module named \\'py4j\\'` while executing `import pyspark`.\\nFor instance, if the file under `${SPARK_HOME}/python/lib/` was `py4j-0.10.9.3-src.zip`.\\nThen the export PYTHONPATH statement above should be changed to `export PYTHONPATH=\"${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH\"` appropriately.\\nAdditionally, you can check for the version of ‘py4j’ of the spark you’re using from here and update as mentioned above.\\n~ Abhijit Chakraborty: Sometimes, even with adding the correct version of py4j might not solve the problem. Simply run pip install py4j and problem should be resolved.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"Py4JJavaError - ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`\"},\n",
       "  {'text': 'If below does not work, then download the latest available py4j version with\\nconda install -c conda-forge py4j\\nTake care of the latest version number in the website to replace appropriately.\\nNow add\\nexport PYTHONPATH=\"${SPARK_HOME}/python/:$PYTHONPATH\"\\nexport PYTHONPATH=\"${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH\"\\nin your  .bashrc file.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"Py4J Error - ModuleNotFoundError: No module named 'py4j' (Solve with latest version)\"},\n",
       "  {'text': 'Even after we have exported our paths correctly you may find that  even though Jupyter is installed you might not have Jupyter Noteboopgak for one reason or another. Full instructions are found here (for my walkthrough) or here (where I got the original instructions from) but are included below. These instructions include setting up a virtual environment (handy if you are on your own machine doing this and not a VM):\\nFull steps:\\nUpdate and upgrade packages:\\nsudo apt update && sudo apt -y upgrade\\nInstall Python:\\nsudo apt install python3-pip python3-dev\\nInstall Python virtualenv:\\nsudo -H pip3 install --upgrade pip\\nsudo -H pip3 install virtualenv\\nCreate a Python Virtual Environment:\\nmkdir notebook\\ncd notebook\\nvirtualenv jupyterenv\\nsource jupyterenv/bin/activate\\nInstall Jupyter Notebook:\\npip install jupyter\\nRun Jupyter Notebook:\\njupyter notebook',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Exception: Jupyter command `jupyter-notebook` not found.'},\n",
       "  {'text': 'Code executed:\\ndf = spark.read.parquet(pq_path)\\n… some operations on df …\\ndf.write.parquet(pq_path, mode=\"overwrite\")\\njava.io.FileNotFoundException: File file:/home/xxx/code/data/pq/fhvhv/2021/02/part-00021-523f9ad5-14af-4332-9434-bdcb0831f2b7-c000.snappy.parquet does not exist\\nThe problem is that Sparks performs lazy transformations, so the actual action that trigger the job is df.write, which does delete the parquet files that is trying to read (mode=”overwrite”)\\n✅Solution: Write to a different directorydf\\ndf.write.parquet(pq_path_temp, mode=\"overwrite\")',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Error java.io.FileNotFoundException'},\n",
       "  {'text': 'You need to create the Hadoop /bin directory manually and add the downloaded files in there, since the shell script provided for Windows installation just puts them in /c/tools/hadoop-3.2.0/ .',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Hadoop - FileNotFoundException: Hadoop bin directory does not exist , when trying to write (Windows)'},\n",
       "  {'text': 'Actually Spark SQL is one independent “type” of SQL - Spark SQL.\\nThe several SQL providers are very similar:\\nSELECT [attributes]\\nFROM [table]\\nWHERE [filter]\\nGROUP BY [grouping attributes]\\nHAVING [filtering the groups]\\nORDER BY [attribute to order]\\n(INNER/FULL/LEFT/RIGHT) JOIN [table2]\\nON [attributes table joining table2] (...)\\nWhat differs the most between several SQL providers are built-in functions.\\nFor Built-in Spark SQL function check this link: https://spark.apache.org/docs/latest/api/sql/index.html\\nExtra information on SPARK SQL :\\nhttps://databricks.com/glossary/what-is-spark-sql#:~:text=Spark%20SQL%20is%20a%20Spark,on%20existing%20deployments%20and%20data.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Which type of SQL is used in Spark? Postgres? MySQL? SQL Server?'},\n",
       "  {'text': \"✅Solution: I had two notebooks running, and the one I wanted to look at had opened a port on localhost:4041.\\nIf a port is in use, then Spark uses the next available port number. It can be even 4044. Clean up after yourself when a port does not work or a container does not run.\\nYou can run spark.sparkContext.uiWebUrl\\nand result will be some like\\n'http://172.19.10.61:4041'\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'The spark viewer on localhost:4040 was not showing the current run'},\n",
       "  {'text': '✅Solution: replace Java Developer Kit 11 with Java Developer Kit 8.\\nJava - RuntimeError: Java gateway process exited before sending its port number\\nShows java_home is not set on the notebook log\\nhttps://sparkbyexamples.com/pyspark/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-port-number/\\nhttps://twitter.com/drkrishnaanand/status/1765423415878463839',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Java - java.lang.NoSuchMethodError: sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner Error during repartition call (conda pyspark installation)'},\n",
       "  {'text': '✅I got it working using `gcs-connector-hadoop-2.2.5-shaded.jar` and Spark 3.1\\nI also added the google_credentials.json and .p12 to auth with gcs. These files are downloadable from GCP Service account.\\nTo create the SparkSession:\\nspark = SparkSession.builder.master(\\'local[*]\\') \\\\\\n.appName(\\'spark-read-from-bigquery\\') \\\\\\n.config(\\'BigQueryProjectId\\',\\'razor-project-xxxxxxx) \\\\\\n.config(\\'BigQueryDatasetLocation\\',\\'de_final_data\\') \\\\\\n.config(\\'parentProject\\',\\'razor-project-xxxxxxx) \\\\\\n.config(\"google.cloud.auth.service.account.enable\", \"true\") \\\\\\n.config(\"credentialsFile\", \"google_credentials.json\") \\\\\\n.config(\"GcpJsonKeyFile\", \"google_credentials.json\") \\\\\\n.config(\"spark.driver.memory\", \"4g\") \\\\\\n.config(\"spark.executor.memory\", \"2g\") \\\\\\n.config(\"spark.memory.offHeap.enabled\",True) \\\\\\n.config(\"spark.memory.offHeap.size\",\"5g\") \\\\\\n.config(\\'google.cloud.auth.service.account.json.keyfile\\', \"google_credentials.json\") \\\\\\n.config(\"fs.gs.project.id\", \"razor-project-xxxxxxx\") \\\\\\n.config(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\") \\\\\\n.config(\"fs.AbstractFileSystem.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\") \\\\\\n.getOrCreate()',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark fails when reading from BigQuery and using `.show()` on `SELECT` queries'},\n",
       "  {'text': 'While creating a SparkSession using the config spark.jars.packages as com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2\\nspark = SparkSession.builder.master(\\'local\\').appName(\\'bq\\').config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2\").getOrCreate()\\nautomatically downloads the required dependency jars and configures the connector, removing the need to manage this dependency. More details available here',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark BigQuery connector Automatic configuration'},\n",
       "  {'text': 'Link to Slack Thread : has anyone figured out how to read from GCP data lake instead of downloading all the taxi data again?\\nThere’s a few extra steps to go into reading from GCS with PySpark\\n1.)  IMPORTANT: Download the Cloud Storage connector for Hadoop here: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#clusters\\nAs the name implies, this .jar file is what essentially connects PySpark with your GCS\\n2.) Move the .jar file to your Spark file directory. I installed Spark using homebrew on my MacOS machine and I had to create a /jars directory under \"/opt/homebrew/Cellar/apache-spark/3.2.1/ (where my spark dir is located)\\n3.) In your Python script, there are a few extra classes you’ll have to import:\\nimport pyspark\\nfrom pyspark.sql import SparkSession\\nfrom pyspark.conf import SparkConf\\nfrom pyspark.context import SparkContext\\n4.) You must set up your configurations before building your SparkSession. Here’s my code snippet:\\nconf = SparkConf() \\\\\\n.setMaster(\\'local[*]\\') \\\\\\n.setAppName(\\'test\\') \\\\\\n.set(\"spark.jars\", \"/opt/homebrew/Cellar/apache-spark/3.2.1/jars/gcs-connector-hadoop3-latest.jar\") \\\\\\n.set(\"spark.hadoop.google.cloud.auth.service.account.enable\", \"true\") \\\\\\n.set(\"spark.hadoop.google.cloud.auth.service.account.json.keyfile\", \"path/to/google_credentials.json\")\\nsc = SparkContext(conf=conf)\\nsc._jsc.hadoopConfiguration().set(\"fs.AbstractFileSystem.gs.impl\",  \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\")\\nsc._jsc.hadoopConfiguration().set(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\")\\nsc._jsc.hadoopConfiguration().set(\"fs.gs.auth.service.account.json.keyfile\", \"path/to/google_credentials.json\")\\nsc._jsc.hadoopConfiguration().set(\"fs.gs.auth.service.account.enable\", \"true\")\\n5.) Once you run that, build your SparkSession with the new parameters we’d just instantiated in the previous step:\\nspark = SparkSession.builder \\\\\\n.config(conf=sc.getConf()) \\\\\\n.getOrCreate()\\n6.) Finally, you’re able to read your files straight from GCS!\\ndf_green = spark.read.parquet(\"gs://{BUCKET}/green/202*/\")',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark Cloud Storage connector'},\n",
       "  {'text': 'from pyarrow.parquet import ParquetFile\\npf = ParquetFile(\\'fhvhv_tripdata_2021-01.parquet\\')\\n#pyarrow builds tables, not dataframes\\ntbl_small = next(pf.iter_batches(batch_size = 1000))\\n#this function converts the table to a dataframe of manageable size\\ndf = tbl_small.to_pandas()\\nAlternatively without PyArrow:\\ndf = spark.read.parquet(\\'fhvhv_tripdata_2021-01.parquet\\')\\ndf1 = df.sort(\\'DOLocationID\\').limit(1000)\\npdf = df1.select(\"*\").toPandas()\\ngcsu',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'How can I read a small number of rows from the parquet file directly?'},\n",
       "  {'text': 'Probably you’ll encounter this if you followed the video ‘5.3.1 - First Look at Spark/PySpark’ and used the parquet file from the TLC website (csv was used in the video).\\nWhen defining the schema, the PULocation and DOLocationID are defined as IntegerType. This will cause an error because the Parquet file is INT64 and you’ll get an error like:\\nParquet column cannot be converted in file [...] Column [...] Expected: int, Found: INT64\\nChange the schema definition from IntegerType to LongType and it should work',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'DataType error when creating Spark DataFrame with a specified schema?'},\n",
       "  {'text': 'df_finalx=df_finalw.select([col(x).alias(x.replace(\" \",\"\")) for x in df_finalw.columns])\\nKrishna Anand',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Remove white spaces from column names in Pyspark'},\n",
       "  {'text': 'This error comes up on the Spark video 5.3.1 - First Look at Spark/PySpark,\\nbecause as at the creation of the video, 2021 data was the most recent which utilised csv files but as at now its parquet.\\nSo when you run the command spark.createDataFrame(df1_pandas).show(),\\nYou get the Attribute error. This is caused by the pandas version 2.0.0 which seems incompatible with Spark 3.3.2, so to fix it you have to downgrade pandas to 1.5.3 using the command pip install -U pandas==1.5.3\\nAnother option is adding the following after importing pandas, if one does not want to downgrade pandas version (source) :\\npd.DataFrame.iteritems = pd.DataFrame.items\\nNote that this problem is solved with Spark versions from 3.4.1',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"AttributeError: 'DataFrame' object has no attribute 'iteritems'\"},\n",
       "  {'text': 'Another alternative is to install pandas 2.0.1 (it worked well as at the time of writing this), and it is compatible with Pyspark 3.5.1. Make sure to add or edit your environment variable like this:\\nexport SPARK_HOME=\"${HOME}/spark/spark-3.5.1-bin-hadoop3\"\\nexport PATH=\"${SPARK_HOME}/bin:${PATH}\"',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"AttributeError: 'DataFrame' object has no attribute 'iteritems'\"},\n",
       "  {'text': 'Open a CMD terminal in administrator mode\\ncd %SPARK_HOME%\\nStart a master node: bin\\\\spark-class org.apache.spark.deploy.master.Master\\nStart a worker node: bin\\\\spark-class org.apache.spark.deploy.worker.Worker spark://<master_ip>:<port> --host <IP_ADDR>\\nbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host <IP_ADDR>\\nspark://<master_ip>:<port>: copy the address from the previous command, in my case it was spark://localhost:7077\\nUse --host <IP_ADDR> if you want to run the worker on a different machine. For now leave it empty.\\nNow you can access Spark UI through localhost:8080\\nHomework for Module 5:\\nDo not refer to the homework file located under /05-batch/code/. The correct file is located under\\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2024/05-batch/homework.md',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark Standalone Mode on Windows'},\n",
       "  {'text': 'You can either type the export command every time you run a new session, add it to the .bashrc/ which you can find in /home or run this command at the beginning of your homebook:\\nimport findspark\\nfindspark.init()',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Export PYTHONPATH command in linux is temporary'},\n",
       "  {'text': 'I solved this issue: unzip the file with:\\nf\\nbefore creating head.csv',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Compressed file ended before the end-of-stream marker was reached'},\n",
       "  {'text': 'In the code along from Video 5.3.3 Alexey downloads the CSV files from the NYT website and gzips them in their bash script. If we now (2023) follow along but download the data from the GH course Repo, it will already be zippes as csv.gz files. Therefore we zip it again if we follow the code from the video exactly. This then leads to gibberish outcome when we then try to cat the contents or count the lines with zcat, because the file is zipped twitch and zcat only unzips it once.\\n✅solution: do not gzip the files downloaded from the course repo. Just wget them and save them as they are as csv.gz files. Then the zcat command and the showSchema command will also work\\nURL=\"${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz\"\\nLOCAL_PREFIX=\"data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}\"\\nLOCAL_FILE=\"${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz\"\\nLOCAL_PATH=\"${LOCAL_PREFIX}/${LOCAL_FILE}\"\\necho \"downloading ${URL} to ${LOCAL_PATH}\"\\nmkdir -p ${LOCAL_PREFIX}\\nwget ${URL} -O ${LOCAL_PATH}\\necho \"compressing ${LOCAL_PATH}\"\\n# gzip ${LOCAL_PATH} <- uncomment this line',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Compression Error: zcat output is gibberish, seems like still compressed'},\n",
       "  {'text': 'Occurred while running : spark.createDataFrame(df_pandas).show()\\nThis error is usually due to the python version, since spark till date of 2 march 2023 doesn’t support python 3.11, try creating a new env with python version 3.8 and then run this command.\\nOn the virtual machine, you can create a conda environment (here called myenv) with python 3.10 installed:\\nconda create -n myenv python=3.10 anaconda\\nThen you must run conda activate myenv to run python 3.10. Otherwise you’ll still be running version 3.11. You can deactivate by typing conda deactivate.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'PicklingError: Could not serialise object: IndexError: tuple index out of range.'},\n",
       "  {'text': 'Make sure you have your credentials of your GCP in your VM under the location defined in the script.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Connecting from local Spark to GCS - Spark does not find my google credentials as shown in the video?'},\n",
       "  {'text': 'To run spark in docker setup\\n1. Build bitnami spark docker\\na. clone bitnami repo using command\\ngit clone https://github.com/bitnami/containers.git\\n(tested on commit 9cef8b892d29c04f8a271a644341c8222790c992)\\nb. edit file `bitnami/spark/3.3/debian-11/Dockerfile` and update java and spark version as following\\n\"python-3.10.10-2-linux-${OS_ARCH}-debian-11\" \\\\\\n\"java-17.0.5-8-3-linux-${OS_ARCH}-debian-11\" \\\\\\nreference: https://github.com/bitnami/containers/issues/13409\\nc. build docker image by navigating to above directory and running docker build command\\nnavigate cd bitnami/spark/3.3/debian-11/\\nbuild command docker build -t spark:3.3-java-17 .\\n2. run docker compose\\nusing following file\\n```yaml docker-compose.yml\\nversion: \\'2\\'\\nservices:\\nspark:\\nimage: spark:3.3-java-17\\nenvironment:\\n- SPARK_MODE=master\\n- SPARK_RPC_AUTHENTICATION_ENABLED=no\\n- SPARK_RPC_ENCRYPTION_ENABLED=no\\n- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\\n- SPARK_SSL_ENABLED=no\\nvolumes:\\n- \"./:/home/jovyan/work:rw\"\\nports:\\n- \\'8080:8080\\'\\n- \\'7077:7077\\'\\nspark-worker:\\nimage: spark:3.3-java-17\\nenvironment:\\n- SPARK_MODE=worker\\n- SPARK_MASTER_URL=spark://spark:7077\\n- SPARK_WORKER_MEMORY=1G\\n- SPARK_WORKER_CORES=1\\n- SPARK_RPC_AUTHENTICATION_ENABLED=no\\n- SPARK_RPC_ENCRYPTION_ENABLED=no\\n- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\\n- SPARK_SSL_ENABLED=no\\nvolumes:\\n- \"./:/home/jovyan/work:rw\"\\nports:\\n- \\'8081:8081\\'\\nspark-nb:\\nimage: jupyter/pyspark-notebook:java-17.0.5\\nenvironment:\\n- SPARK_MASTER_URL=spark://spark:7077\\nvolumes:\\n- \"./:/home/jovyan/work:rw\"\\nports:\\n- \\'8888:8888\\'\\n- \\'4040:4040\\'\\n```\\nrun command to deploy docker compose\\ndocker-compose up\\nAccess jupyter notebook using link logged in docker compose logs\\nSpark master url is spark://spark:7077',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Spark docker-compose setup'},\n",
       "  {'text': 'To do this\\npip install gcsfs,\\nThereafter copy the uri path to the file and use \\ndf = pandas.read_csc(gs://path)',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'How do you read data stored in gcs on pandas with your local computer?'},\n",
       "  {'text': 'Error:\\nspark.createDataFrame(df_pandas).schema\\nTypeError: field Affiliated_base_number: Can not merge type <class \\'pyspark.sql.types.StringType\\'> and <class \\'pyspark.sql.types.DoubleType\\'>\\nSolution:\\nAffiliated_base_number is a mix of letters and numbers (you can check this with a preview of the table), so it cannot be set to DoubleType (only for double-precision numbers). The suitable type would be StringType. Spark  inferSchema is more accurate than Pandas infer type method in this case. You can set it to  true  while reading the csv, so you don’t have to take out any data from your dataset. Something like this can help:\\ndf = spark.read \\\\\\n.options(\\nheader = \"true\", \\\\\\ninferSchema = \"true\", \\\\\\n) \\\\\\n.csv(\\'path/to/your/csv/file/\\')\\nSolution B:\\nIt\\'s because some rows in the affiliated_base_number are null and therefore it is assigned the datatype String and this cannot be converted to type Double. So if you really want to convert this pandas df to a pyspark df only take the  rows from the pandas df that are not null in the \\'Affiliated_base_number\\' column. Then you will be able to apply the pyspark function createDataFrame.\\n# Only take rows that have no null values\\npandas_df= pandas_df[pandas_df.notnull().all(1)]',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'TypeError when using spark.createDataFrame function on a pandas df'},\n",
       "  {'text': 'Default executor memory is 1gb. This error appeared when working with the homework dataset.\\nError: MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory\\nScaling row group sizes to 95.00% for 8 writers\\nSolution:\\nIncrease the memory of the executor when creating the Spark session like this:\\nRemember to restart the Jupyter session (ie. close the Spark session) or the config won’t take effect.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory'},\n",
       "  {'text': 'Change the working directory to the spark directory:\\nif you have setup up your SPARK_HOME variable, use the following;\\ncd %SPARK_HOME%\\nif not, use the following;\\ncd <path to spark installation>\\nCreating a Local Spark Cluster\\nTo start Spark Master:\\nbin\\\\spark-class org.apache.spark.deploy.master.Master --host localhost\\nStarting up a cluster:\\nbin\\\\spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host localhost',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'How to spark standalone cluster is run on windows OS'},\n",
       "  {'text': 'I added PYTHONPATH, JAVA_HOME and SPARK_HOME to ~/.bashrc, import pyspark worked ok in iPython in terminal, but couldn’t be found in .ipynb opened in VS Code\\nAfter adding new lines to ~/.bashrc, need to restart the shell to activate the new lines, do either\\nsource ~/.bashrc\\nexec bash\\nInstead of configuring paths in ~/.bashrc, I created .env file in the root of my workspace:',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Env variables set in ~/.bashrc are not loaded to Jupyter in VS Code'},\n",
       "  {'text': 'I don’t use visual studio, so I did it the old fashioned way: ssh -L 8888:localhost:8888 <my user>@<VM IP> (replace user and IP with the ones used by the GCP VM, e.g. : ssh -L 8888:localhost:8888 myuser@34.140.188.1',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'How to port forward outside VS Code'},\n",
       "  {'text': 'If you are doing wc -l fhvhv_tripdata_2021-01.csv.gz  with the gzip file as the file argument, you will get a different result, obviously! Since the file is compressed.\\nUnzip the file and then do wc -l fhvhv_tripdata_2021-01.csv to get the right results.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': '“wc -l” is giving a different result then shown in the video'},\n",
       "  {'text': 'when trying to:\\nURL=\"spark://$HOSTNAME:7077\"\\nspark-submit \\\\\\n--master=\"{$URL}\" \\\\\\n06_spark_sql.py \\\\\\n--input_green=data/pq/green/2021/*/ \\\\\\n--input_yellow=data/pq/yellow/2021/*/ \\\\\\n--output=data/report-2021\\nand you get errors like the following (SUMMARIZED):\\nWARN Utils: Your hostname, <HOSTNAME> resolves to a loopback address..\\nWARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to \"WARN\".\\nException in thread \"main\" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local at …\\nTry replacing --master=\"{$URL}\"\\nwith --master=$URL (edited)\\nExtra edit for spark version 3.4.2 - if encountering:\\n`Error: Unrecognized option: --master=`\\n→ Replace `--master=\"{$URL}\"` with  `--master \"${URL}\"`',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': '`spark-submit` errors'},\n",
       "  {'text': 'If you are seeing this (or similar) error when attempting to write to parquet, it is likely an issue with your path variables.\\nFor Windows, create a new User Variable “HADOOP_HOME” that points to your Hadoop directory. Then add “%HADOOP_HOME%\\\\bin” to the PATH variable.\\nAdditional tips can be found here: https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Hadoop - Exception in thread \"main\" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z'},\n",
       "  {'text': \"Change the hadoop version to 3.0.1.Replace all the files in the local hadoop bin folder with the files in this repo:  winutils/hadoop-3.0.1/bin at master · cdarlint/winutils (github.com)\\nIf this does not work try to change other versions found in this repository.\\nFor more information please see this link: This version of %1 is not compatible with the version of Windows you're running · Issue #20 · cdarlint/winutils (github.com)\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Java.io.IOException. Cannot run program “C:\\\\hadoop\\\\bin\\\\winutils.exe”. CreateProcess error=216, This version of 1% is not compatible with the version of Windows you are using.'},\n",
       "  {'text': 'Fix is to set the flag like the error states. Get your project ID from your dashboard and set it like so:\\ngcloud dataproc jobs submit pyspark \\\\\\n--cluster=my_cluster \\\\\\n--region=us-central1 \\\\\\n--project=my-dtc-project-1010101 \\\\\\ngs://my-dtc-bucket-id/code/06_spark_sql.py\\n-- \\\\\\n…',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Dataproc - ERROR: (gcloud.dataproc.jobs.submit.pyspark) The required property [project] is not currently set. It can be set on a per-command basis by re-running your command with the [--project] flag.'},\n",
       "  {'text': 'Go to %SPARK_HOME%\\\\bin\\nRun spark-class org.apache.spark.deploy.master.Master to run the master. This will give you a URL of the form spark://ip:port\\nRun spark-class org.apache.spark.deploy.worker.Worker spark://ip:port to run the worker. Make sure you use the URL you obtained in step 2.\\nCreate a new Jupyter notebook:\\nspark = SparkSession.builder \\\\\\n.master(\"spark://{ip}:7077\") \\\\\\n.appName(\\'test\\') \\\\\\n.getOrCreate()\\nCheck on Spark UI the master, worker and app.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Run Local Cluster Spark in Windows 10 with CMD'},\n",
       "  {'text': 'This occurs because you are not logged in “gcloud auth login” and maybe the project id is not settled. Then type in a terminal:\\ngcloud auth login\\nThis will open a tab in the browser, accept the terms, after that close the tab if you want. Then set the project is like:\\ngcloud config set project <YOUR PROJECT_ID>\\nThen you can run the command to upload the pq dir to a GCS Bucket:\\ngsutil -m cp -r pq/ <YOUR URI from gsutil>/pq',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"lServiceException: 401 Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).\"},\n",
       "  {'text': \"When submit a job, it might throw an error about Java in log panel within Dataproc. I changed the Versioning Control when I created a cluster, so it means that I delete the cluster and created a new one, and instead of choosing Debian-Hadoop-Spark, I switch to Ubuntu 20.02-Hadoop3.3-Spark3.3 for Versioning Control feature, the main reason to choose this is because I have the same Ubuntu version in mi laptop, I tried to find documentation to sustent this but unfortunately I couldn't nevertheless it works for me.\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'py4j.protocol.Py4JJavaError  GCP'},\n",
       "  {'text': \"Use both repartition and coalesce, like so:\\ndf = df.repartition(6)\\ndf = df.coalesce(6)\\ndf.write.parquet('fhv/2019/10', mode='overwrite')\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Repartition the Dataframe to 6 partitions using df.repartition(6) - got 8 partitions instead'},\n",
       "  {'text': \"Possible solution - Try to forward the port using ssh cli instead of vs code.\\nRun > “ssh -L <local port>:<VM host/ip>:<VM port> <ssh hostname>”\\nssh hostname is the name you specified in the ~/.ssh/config file\\nIn case of Jupyter Notebook run\\n“ssh -L 8888:localhost:8888 gcp-vm”\\nfrom your local machine’s cli.\\nNOTE: If you logout from the session, the connection would break. Also while creating the spark session notice the block's log because sometimes it fails to run at 4040 and then switches to 4041.\\n~Abhijit Chakrborty: If you are having trouble accessing localhost ports from GCP VM consider adding the forwarding instructions to .ssh/config file as following:\\n```\\nHost <hostname>\\nHostname <external-gcp-ip>\\nUser xxxx\\nIdentityFile yyyy\\nLocalForward 8888 localhost:8888\\nLocalForward 8080 localhost:8080\\nLocalForward 5432 localhost:5432\\nLocalForward 4040 localhost:4040\\n```\\nThis should automatically forward all ports and will enable accessing localhost ports.\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Jupyter Notebook or SparkUI not loading properly at localhost after port forwarding from VS code?'},\n",
       "  {'text': '~ Abhijit Chakraborty\\n`sdk list java`  to check for available java sdk versions.\\n`sdk install java 11.0.22-amzn`  as  java-11.0.22-amzn was available for my codespace.\\nclick on Y if prompted to change the default java version.\\nCheck for java version using `java -version `.\\nIf working fine great, else `sdk default java 11.0.22-amzn` or whatever version you have installed.',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Installing Java 11 on codespaces'},\n",
       "  {'text': 'Sometimes while creating a dataproc cluster on GCP, the following error is encountered.\\nSolution: As mentioned here, sometimes there might not be enough resources in the given region to allocate the request. Usually, gets freed up in a bit and one can create a cluster. – abhirup ghosh\\nSolution 2:  Changing the type of boot-disk from PD-Balanced to PD-Standard, in terraform, helped solve the problem.- Sundara Kumar Padmanabhan',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': \"Error: Insufficient 'SSD_TOTAL_GB' quota. Requested 500.0, available 470.0.\"},\n",
       "  {'text': \"Pyspark converts the difference of two TimestampType values to Python's native datetime.timedelta object. The timedelta object only stores the duration in terms of days, seconds, and microseconds. Each of the three units of time must be manually converted into hours in order to express the total duration between the two timestamps using only hours.\\nAnother way for achieving this is using the datediff (sql function). It receives this parameters\\nUpper Date: the closest date you have. For example dropoff_datetime\\nLower Date: the farthest date you have.  For example pickup_datetime\\nAnd the result is returned in terms of days, so you could multiply the result for 24 in order to get the hours.\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Homework - how to convert the time difference of two timestamps to hours'},\n",
       "  {'text': 'This version combination worked for me:\\nPySpark = 3.3.2\\nPandas = 1.5.3\\n\\nIf it still has an error,',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'PicklingError: Could not serialize object: IndexError: tuple index out of range'},\n",
       "  {'text': \"Run this before SparkSession\\nimport os\\nimport sys\\nos.environ['PYSPARK_PYTHON'] = sys.executable\\nos.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Py4JJavaError: An error occurred while calling o180.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6) (host.docker.internal executor driver): org.apache.spark.SparkException: Python worker failed to connect back.'},\n",
       "  {'text': \"import os\\nimport sys\\nos.environ['PYSPARK_PYTHON'] = sys.executable\\nos.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable\\nDataproc Pricing: https://cloud.google.com/dataproc/pricing#on_gke_pricing\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'RuntimeError: Python in worker has different version 3.11 than that in driver 3.10, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.'},\n",
       "  {'text': 'Ans: No, you can submit a job to DataProc from your local computer by installing gsutil (https://cloud.google.com/storage/docs/gsutil_install) and configuring it. Then, you can execute the following command from your local computer.\\ngcloud dataproc jobs submit pyspark \\\\\\n--cluster=de-zoomcamp-cluster \\\\\\n--region=europe-west6 \\\\\\ngs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \\\\\\n-- \\\\\\n--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \\\\\\n--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \\\\\\n--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020 (edited)',\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'Dataproc Qn: Is it essential to have a VM on GCP for running Dataproc and submitting jobs ?'},\n",
       "  {'text': \"AttributeError: 'DataFrame' object has no attribute 'iteritems'\\nthis is because the method inside the pyspark refers to a package that has been already deprecated\\n(https://stackoverflow.com/questions/76404811/attributeerror-dataframe-object-has-no-attribute-iteritems)\\nYou can do this code below, which is mentioned in the stackoverflow link above:\\nQ: DE Zoomcamp 5.6.3 - Setting up a Dataproc Cluster I cannot create a cluster and get this message. I tried many times as the FAQ said, but it didn't work. What can I do?\\nError\\nInsufficient 'SSD_TOTAL_GB' quota. Requested 500.0, available 250.0.\\nRequest ID: 17942272465025572271\\nA: The master and worker nodes are allocated a maximum of 250 GB of memory combined. In the configuration section, adhere to the following specifications:\\nMaster Node:\\nMachine type: n2-standard-2\\nPrimary disk size: 85 GB\\nWorker Node:\\nNumber of worker nodes: 2\\nMachine type: n2-standard-2\\nPrimary disk size: 80 GB\\nYou can allocate up to 82.5 GB memory for worker nodes, keeping in mind that the total memory allocated across all nodes cannot exceed 250 GB.\",\n",
       "   'section': 'Module 5: pyspark',\n",
       "   'question': 'In module 5.3.1, trying to run spark.createDataFrame(df_pandas).show() returns error'},\n",
       "  {'text': 'The MacOS setup instruction (https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/macos.md#installing-java) for setting the JAVA_HOME environment variable is for Intel-based Macs which have a default install location at /usr/local/. If you have an Apple Silicon mac, you will have to set JAVA_HOME to /opt/homebrew/, specifically in your .bashrc or .zshrc:\\nexport JAVA_HOME=\"/opt/homebrew/opt/openjdk/bin\"\\nexport PATH=\"$JAVA_HOME:$PATH\"\\nConfirm that your path was correctly set by running the command: which java\\nYou should expect to see the output:\\n/opt/homebrew/opt/openjdk/bin/java\\nReference: https://docs.brew.sh/Installation',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Setting JAVA_HOME with Homebrew on Apple Silicon'},\n",
       "  {'text': 'Check Docker Compose File:\\nEnsure that your docker-compose.yaml file is correctly configured with the necessary details for the \"control-center\" service. Check the service name, image name, ports, volumes, environment variables, and any other configurations required for the container to start.\\nOn Mac OSX 12.2.1 (Monterey) I could not start the kafka control center. I opened Docker Desktop and saw docker images still running from week 4, which I did not see when I typed “docker ps.” I deleted them in docker desktop and then had no problem starting up the kafka environment.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Could not start docker image “control-center” from the docker-compose.yaml file.'},\n",
       "  {'text': \"Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\\nTo create a virtual env and install packages (run only once)\\npython -m venv env\\nsource env/bin/activate\\npip install -r ../requirements.txt\\nTo activate it (you'll need to run it every time you need the virtual env):\\nsource env/bin/activate\\nTo deactivate it:\\ndeactivate\\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.\",\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Module “kafka” not found when trying to run producer.py'},\n",
       "  {'text': 'ImportError: DLL load failed while importing cimpl: The specified module could not be found\\nVerify Python Version:\\nMake sure you are using a compatible version of Python with the Avro library. Check the Python version and compatibility requirements specified by the Avro library documentation.\\n... you may have to load librdkafka-5d2e2910.dll in the code. Add this before importing avro:\\nfrom ctypes import CDLL\\nCDLL(\"C:\\\\\\\\Users\\\\\\\\YOUR_USER_NAME\\\\\\\\anaconda3\\\\\\\\envs\\\\\\\\dtcde\\\\\\\\Lib\\\\\\\\site-packages\\\\\\\\confluent_kafka.libs\\\\librdkafka-5d2e2910.dll\")\\nIt seems that the error may occur depending on the OS and python version installed.\\nALTERNATIVE:\\nImportError: DLL load failed while importing cimpl\\n✅SOLUTION: $env:CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1 in Powershell.\\nYou need to set this DLL manually in Conda Env.\\nSource: https://githubhot.com/repo/confluentinc/confluent-kafka-python/issues/1186?page=2',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Error importing cimpl dll when running avro examples'},\n",
       "  {'text': \"✅SOLUTION: pip install confluent-kafka[avro].\\nFor some reason, Conda also doesn't include this when installing confluent-kafka via pip.\\nMore sources on Anaconda and confluent-kafka issues:\\nhttps://github.com/confluentinc/confluent-kafka-python/issues/590\\nhttps://github.com/confluentinc/confluent-kafka-python/issues/1221\\nhttps://stackoverflow.com/questions/69085157/cannot-import-producer-from-confluent-kafka\",\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': \"ModuleNotFoundError: No module named 'avro'\"},\n",
       "  {'text': 'If you get an error while running the command python3 stream.py worker\\nRun pip uninstall kafka-python\\nThen run pip install kafka-python==1.4.6\\nWhat is the use of  Redpanda ?\\nRedpanda: Redpanda is built on top of the Raft consensus algorithm and is designed as a high-performance, low-latency alternative to Kafka. It uses a log-centric architecture similar to Kafka but with different underlying principles.\\nRedpanda is a powerful, yet simple, and cost-efficient streaming data platform that is compatible with Kafka® APIs while eliminating Kafka complexity.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Error while running python3 stream.py worker'},\n",
       "  {'text': 'Got this error because the docker container memory was exhausted. The dta file was upto 800MB but my docker container does not have enough memory to handle that.\\nSolution was to load the file in chunks with Pandas, then create multiple parquet files for each dat file I was processing. This worked smoothly and the issue was resolved.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Negsignal:SIGKILL while converting dta files to parquet format'},\n",
       "  {'text': 'Copy the file found in the Java example: data-engineering-zoomcamp/week_6_stream_processing/java/kafka_examples/src/main/resources/rides.csv',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'data-engineering-zoomcamp/week_6_stream_processing/python/resources/rides.csv is missing'},\n",
       "  {'text': 'tip:As the videos have low audio so I downloaded them and used VLC media player with putting the audio to the max 200% of original audio and the audio became quite good or try to use auto caption generated on Youtube directly.\\nKafka Python Videos - Rides.csv\\nThere is no clear explanation of the rides.csv data that the producer.py python programs use. You can find that here https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/2bd33e89906181e424f7b12a299b70b19b7cfcd5/week_6_stream_processing/python/resources/rides.csv.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Kafka- python videos have low audio and hard to follow up'},\n",
       "  {'text': 'If you have this error, it most likely that your kafka broker docker container is not working.\\nUse docker ps to confirm\\nThen in the docker compose yaml file folder, run docker compose up -d to start all the instances.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'kafka.errors.NoBrokersAvailable: NoBrokersAvailable'},\n",
       "  {'text': 'Ankush said we can focus on horizontal scaling option.\\n“think of scaling in terms of scaling from consumer end. Or consuming message via horizontal scaling”',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Kafka homwork Q3, there are options that support scaling concept more than the others:'},\n",
       "  {'text': 'If you get this error, know that you have not built your sparks and juypter images. This images aren’t readily available on dockerHub.\\nIn the spark folder, run ./build.sh from a bash cli to to build all images before running docker compose',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': \"How to fix docker compose error: Error response from daemon: pull access denied for spark-3.3.1, repository does not exist or may require 'docker login': denied: requested access to the resource is denied\"},\n",
       "  {'text': 'Run this command in terminal in the same directory (/docker/spark):\\nchmod +x build.sh',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: ./build.sh: Permission denied Error'},\n",
       "  {'text': 'Restarting all services worked for me:\\ndocker-compose down\\ndocker-compose up',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: ‘KafkaTimeoutError: Failed to update metadata after 60.0 secs.’ when running stream-example/producer.py'},\n",
       "  {'text': 'While following tutorial 13.2 , when running ./spark-submit.sh streaming.py, encountered the following error:\\n…\\n24/03/11 09:48:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\\n24/03/11 09:48:36 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 10 ms (0 ms spent in bootstraps)\\n24/03/11 09:48:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors\\n24/03/11 09:48:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077…\\n24/03/11 09:49:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\\n24/03/11 09:49:36 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.\\n24/03/11 09:49:36 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.\\n…\\npy4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.SparkSession.\\n: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.\\n…\\nSolution:\\nDowngrade your local PySpark to 3.3.1 (same as Dockerfile)\\nThe reason for the failed connection in my case was the mismatch of PySpark versions. You can see that from the logs of spark-master in the docker container.\\nSolution 2:\\nCheck what Spark version your local machine has\\npyspark –version\\nspark-submit –version\\nAdd your version to SPARK_VERSION in build.sh',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: ./spark-submit.sh streaming.py - ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.'},\n",
       "  {'text': 'Start a new terminal\\nRun: docker ps\\nCopy the CONTAINER ID of the spark-master container\\nRun: docker exec -it <spark_master_container_id> bash\\nRun: cat logs/spark-master.out\\nCheck for the log when the error happened\\nGoogle the error message from there',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: ./spark-submit.sh streaming.py - How to check why Spark master connection fails'},\n",
       "  {'text': 'Make sure your java version is 11 or 8.\\nCheck your version by:\\njava --version\\nCheck all your versions by:\\n/usr/libexec/java_home -V\\nIf you already have got java 11 but just not selected as default, select the specific version by:\\nexport JAVA_HOME=$(/usr/libexec/java_home -v 11.0.22)\\n(or other version of 11)',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: ./spark-submit.sh streaming.py Error: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.'},\n",
       "  {'text': 'In my set up, all of the dependencies listed in gradle.build were not installed in <project_name>-1.0-SNAPSHOT.jar.\\nSolution:\\nIn build.gradle file, I added the following at the end:\\nshadowJar {\\narchiveBaseName = \"java-kafka-rides\"\\narchiveClassifier = \\'\\'\\n}\\nAnd then in the command line ran ‘gradle shadowjar’, and run the script from java-kafka-rides-1.0-SNAPSHOT.jar created by the shadowjar',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Java Kafka: <project_name>-1.0-SNAPSHOT.jar errors: package xxx does not exist even after gradle build'},\n",
       "  {'text': 'confluent-kafka: `pip install confluent-kafka` or `conda install conda-forge::python-confluent-kafka`\\nfastavro: pip install fastavro\\nAbhirup Ghosh\\nCan install Faust Library for Module 6 Python Version due to dependency conflicts?\\nThe Faust repository and library is no longer maintained - https://github.com/robinhood/faust\\nIf you do not know Java, you now have the option to follow the Python Videos 6.13 & 6.14 here https://www.youtube.com/watch?v=BgAlVknDFlQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=80  and follow the RedPanda Python version here https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/06-streaming/python/redpanda_example - NOTE: I highly recommend watching the Java videos to understand the concept of streaming but you can skip the coding parts - all will become clear when you get to the Python videos and RedPanda files.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Python Kafka: Installing dependencies for python3 06-streaming/python/avro_example/producer.py'},\n",
       "  {'text': 'In the project directory, run:\\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Java Kafka: How to run producer/consumer/kstreams/etc in terminal'},\n",
       "  {'text': 'For example, when running JsonConsumer.java, got:\\nConsuming form kafka started\\nRESULTS:::0\\nRESULTS:::0\\nRESULTS:::0\\nOr when running JsonProducer.java, got:\\nException in thread \"main\" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SaslAuthenticationException: Authentication failed\\nSolution:\\nMake sure in the scripts in src/main/java/org/example/ that you are running (e.g. JsonConsumer.java, JsonProducer.java), the StreamsConfig.BOOTSTRAP_SERVERS_CONFIG is the correct server url (e.g. europe-west3 from example vs europe-west2)\\nMake sure cluster key and secrets are updated in src/main/java/org/example/Secrets.java (KAFKA_CLUSTER_KEY and KAFKA_CLUSTER_SECRET)',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Java Kafka: When running the producer/consumer/etc java scripts, no results retrieved or no message sent'},\n",
       "  {'text': 'Situation: in VS Code, usually there will be a triangle icon next to each test. I couldn’t see it at first and had to do some fixes.\\nSolution:\\n(Source)\\nVS Code\\n→ Explorer (first icon on the left navigation bar)\\n→ JAVA PROJECTS (bottom collapsable)\\n→  icon next in the rightmost position to JAVA PROJECTS\\n→  clean Workspace\\n→ Confirm by clicking Reload and Delete\\nNow you will be able to see the triangle icon next to each test like what you normally see in python tests.\\nE.g.:\\nYou can also add classes and packages in this window instead of creating files in the project directory',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Java Kafka: Tests are not picked up in VSCode'},\n",
       "  {'text': 'In Confluent Cloud:\\nEnvironment → default (or whatever you named your environment as) → The right navigation bar →  “Stream Governance API” →  The URL under “Endpoint”\\nAnd create credentials from Credentials section below it',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'Confluent Kafka: Where can I find schema registry URL?'},\n",
       "  {'text': 'You can check the version of your local spark using spark-submit --version. In the build.sh file of the Python folder, make sure that SPARK_VERSION matches your local version. Similarly, make sure the pyspark you pip installed also matches this version.',\n",
       "   'section': 'Module 6: streaming with kafka',\n",
       "   'question': 'How do I check compatibility of local and container Spark versions?'},\n",
       "  {'text': 'According to https://github.com/dpkp/kafka-python/\\n“DUE TO ISSUES WITH RELEASES, IT IS SUGGESTED TO USE https://github.com/wbarnha/kafka-python-ng FOR THE TIME BEING”\\nUse pip install kafka-python-ng instead',\n",
       "   'section': 'Project',\n",
       "   'question': 'How to fix the error \"ModuleNotFoundError: No module named \\'kafka.vendor.six.moves\\'\"?'},\n",
       "  {'text': 'Each submitted project will be evaluated by 3 (three) randomly assigned students that have also submitted the project.\\nYou will also be responsible for grading the projects from 3 fellow students yourself. Please be aware that: not complying to this rule also implies you failing to achieve the Certificate at the end of the course.\\nThe final grade you get will be the median score of the grades you get from the peer reviewers.\\nAnd of course, the peer review criteria for evaluating or being evaluated must follow the guidelines defined here.',\n",
       "   'section': 'Project',\n",
       "   'question': 'How is my capstone project going to be evaluated?'},\n",
       "  {'text': 'There is only ONE project for this Zoomcamp. You do not need to submit or create two projects. There are simply TWO chances to pass the course. You can use the Second Attempt if you a) fail the first attempt b) do not have the time due to other engagements such as holiday or sickness etc. to enter your project into the first attempt.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Project 1 & Project 2'},\n",
       "  {'text': 'See a list of datasets here: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_7_project/datasets.md',\n",
       "   'section': 'Project',\n",
       "   'question': 'Does anyone know nice and relatively large datasets?'},\n",
       "  {'text': 'You need to redefine the python environment variable to that of your user account',\n",
       "   'section': 'Project',\n",
       "   'question': 'How to run python as start up script?'},\n",
       "  {'text': 'Initiate a Spark Session\\nspark = (SparkSession\\n.builder\\n.appName(app_name)\\n.master(master=master)\\n.getOrCreate())\\nspark.streams.resetTerminated()\\nquery1 = spark\\n.readStream\\n…\\n…\\n.load()\\nquery2 = spark\\n.readStream\\n…\\n…\\n.load()\\nquery3 = spark\\n.readStream\\n…\\n…\\n.load()\\nquery1.start()\\nquery2.start()\\nquery3.start()\\nspark.streams.awaitAnyTermination() #waits for any one of the query to receive kill signal or error failure. This is asynchronous\\n# On the contrary query3.start().awaitTermination() is a blocking ex call. Works well when we are reading only from one topic.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Spark Streaming - How do I read from multiple topics in the same Spark Session'},\n",
       "  {'text': 'Transformed data can be moved in to azure blob storage and then it can be moved in to azure SQL DB, instead of moving directly from databricks to Azure SQL DB.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Data Transformation from Databricks to Azure SQL DB'},\n",
       "  {'text': 'The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).\\nDetailed explanation here: https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment\\nSource code example here: https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py',\n",
       "   'section': 'Project',\n",
       "   'question': 'Orchestrating dbt with Airflow'},\n",
       "  {'text': 'https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html\\nhttps://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/dataproc.html\\nGive the following roles to you service account:\\nDataProc Administrator\\nService Account User (explanation here)\\nUse DataprocSubmitPySparkJobOperator, DataprocDeleteClusterOperator and  DataprocCreateClusterOperator.\\nWhen using  DataprocSubmitPySparkJobOperator, do not forget to add:\\ndataproc_jars = [\"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.24.0.jar\"]\\nBecause DataProc does not already have the BigQuery Connector.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Orchestrating DataProc with Airflow'},\n",
       "  {'text': 'You can trigger your dbt job in Mage pipeline. For this get your dbt cloud api key under settings/Api tokens/personal tokens. Add it safely to  your .env\\nFor example\\ndbt_api_trigger=dbt_**\\nNavigate to job page and find api trigger  link\\nThen create a custom mage Python block with a simple http request like here\\nfrom dotenv import load_dotenv\\nfrom pathlib import Path\\ndotenv_path = Path(\\'/home/src/.env\\')\\nload_dotenv(dotenv_path=dotenv_path)\\ndbt_api_trigger= os.getenv(dbt_api_trigger)\\nurl = f\"https://cloud.getdbt.com/api/v2/accounts/{dbt_account_id}/jobs/<job_id>/run/\"\\nheaders = {\\n        \"Authorization\": f\"Token {dbt_api_trigger}\",\\n        \"Content-Type\": \"application/json\" }\\nbody = {\\n        \"cause\": \"Triggered via API\"\\n    }\\n    response = requests.post(url, headers=headers, json=body)\\nvoila! You triggered dbt job form your mage pipeline.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Orchestrating dbt cloud with Mage'},\n",
       "  {'text': \"The slack thread : thttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1677678161866999\\nThe question is that sometimes even if you take plenty of effort to document every single step, and we can't even sure if the person doing the peer review will be able to follow-up, so how this criteria will be evaluated?\\nAlex clarifies: “Ideally yes, you should try to re-run everything. But I understand that not everyone has time to do it, so if you check the code by looking at it and try to spot errors, places with missing instructions and so on - then it's already great”\",\n",
       "   'section': 'Project',\n",
       "   'question': 'Project evaluation - Reproducibility'},\n",
       "  {'text': 'The key valut in Azure cloud is used to store credentials or passwords or secrets of different tech stack used in Azure. For example if u do not want to expose the password in SQL database, then we can save the password under a given name and use them in other Azure stack.',\n",
       "   'section': 'Project',\n",
       "   'question': 'Key Vault in Azure cloud stack'},\n",
       "  {'text': 'You can get the version of py4j from inside docker using this command\\ndocker exec -it --user airflow airflow-airflow-scheduler-1 bash -c \"ls /opt/spark/python/lib\"',\n",
       "   'section': 'Project',\n",
       "   'question': \"Spark docker - `ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`\"},\n",
       "  {'text': 'Either use conda or pip for managing venv, using both of them together will cause incompatibility.\\nIf you’re using conda, install psycopg2 using the conda-forge channel, which may handle the architecture compatibility automatically\\nconda install -c conda-forge psycopg2\\nIf pip, do the normal install\\npip install psycopg2',\n",
       "   'section': 'Project',\n",
       "   'question': 'psycopg2 complains of incompatible environment e.g x86 instead of amd'},\n",
       "  {'text': 'This is not a FAQ but more of an advice if you want to set up dbt locally, I did it in the following way:\\nI had the postgres instance from week 2 (year 2024) up (the docker-compose)\\nmkdir dbt\\nvi dbt/profiles.yml\\nAnd here I attached this content (only the required fields) and replaced them with the proper values (for instance mine where in the .env file of the folder of week 2 docker stuff)\\ncd dbt && git clone https://github.com/dbt-labs/dbt-starter-project\\nmkdir project && cd project && mv dbt-starter-project/* .\\nMake sure that you align the profile name in profiles.yml with the dbt_project.yml file\\nAdd this line anywhere on the dbt_project.yml file:\\nconfig-version: 2\\ndocker run --network=mage-zoomcamp_default --mount type=bind,source=/<your-path>/dbt/project,target=/usr/app --mount type=bind,source=/<your-path>/profiles.yml,target=/root/.dbt/profiles.yml ghcr.io/dbt-labs/dbt-postgres ls\\nIf you have trouble run\\ndocker run --network=mage-zoomcamp_default --mount type=bind,source=/<your-path>/dbt/project,target=/usr/app --mount type=bind,source=/<your-path>/profiles.yml,target=/root/.dbt/profiles.yml ghcr.io/dbt-labs/dbt-postgres debug',\n",
       "   'section': 'Project',\n",
       "   'question': 'Setting up dbt locally with Docker and Postgres'},\n",
       "  {'text': 'The following line should be included in pyspark configuration\\n# Example initialization of SparkSession variable\\nspark = (SparkSession.builder\\n.master(...)\\n.appName(...)\\n# Add the following configuration\\n.config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-3.5-bigquery:0.37.0\")\\n)',\n",
       "   'section': 'Project',\n",
       "   'question': 'How to connect Pyspark with BigQuery?'},\n",
       "  {'text': 'Install the astronomer-cosmos package as a dependency. (see Terraform example).\\nMake a new folder, dbt/, inside the dags/ folder of your Composer GCP bucket and copy paste your dbt-core project there. (see example)\\nEnsure your profiles.yml is configured to authenticate with a service account key. (see BigQuery example)\\nCreate a new DAG using the DbtTaskGroup class and a ProfileConfig specifying a profiles_yml_filepath that points to the location of your JSON key file. (see example)\\nYour dbt lineage graph should now appear as tasks inside a task group like this:',\n",
       "   'section': 'Course Management Form for Homeworks',\n",
       "   'question': 'How to run a dbt-core project as an Airflow Task Group on Google Cloud Composer using a service account JSON key'},\n",
       "  {'text': 'The display name listed on the leaderboard is an auto-generated randomized name. You can edit it to be a nickname, or your real name, if you prefer. Your entry on the Leaderboard is the one highlighted in teal(?) / light green (?).\\nThe Certificate name should be your actual name that you want to appear on your certificate after completing the course.\\nThe \"Display on Leaderboard\" option indicates whether you want your name to be listed on the course leaderboard.\\nQuestion: Is it possible to create external tables in BigQuery using URLs, such as those from the NY Taxi data website?\\nAnswer: Not really, only Bigtable, Cloud Storage, and Google Drive are supported data stores.',\n",
       "   'section': 'Workshop 1 - dlthub',\n",
       "   'question': 'Edit Course Profile.'},\n",
       "  {'text': \"Answer: To run the provided code, ensure that the 'dlt[duckdb]' package is installed. You can do this by executing the provided installation command: !pip install dlt[duckdb]. If you’re doing it locally, be sure to also have duckdb pip installed (even before the duckdb package is loaded).\",\n",
       "   'section': 'Workshop 1 - dlthub',\n",
       "   'question': 'How do I install the necessary dependencies to run the code?'},\n",
       "  {'text': 'If you are running Jupyter Notebook on a fresh new Codespace or in local machine with a new Virtual Environment, you will need this package to run the starter Jupyter Notebook offered by the teacher. Execute this:\\npip install jupyter',\n",
       "   'section': 'Workshop 1 - dlthub',\n",
       "   'question': 'Other packages needed but not listed'},\n",
       "  {'text': 'Alternatively, you can switch to in-file storage with:',\n",
       "   'section': 'Workshop 1 - dlthub',\n",
       "   'question': 'How can I use DuckDB In-Memory database with dlt ?'},\n",
       "  {'text': 'After loading, you should have a total of 8 records, and ID 3 should have age 33\\nQuestion: Calculate the sum of ages of all the people loaded as described above\\nThe sum of all eight records\\' respective ages is too big to be in the choices. You need to first filter out the people whose occupation is equal to None in order to get an answer that is close to or present in the given choices. 😃\\n----------------------------------------------------------------------------------------\\nFIXED = use a raw string and keep the file:/// at the start of your file path\\nI\\'m having an issue with the dlt workshop notebook. The \\'Load to Parquet file\\' section specifically. No matter what I change the file path to, it\\'s still saving the dlt files directly to my C drive.\\n# Set the bucket_url. We can also use a local folder\\nos.environ[\\'DESTINATION__FILESYSTEM__BUCKET_URL\\'] = r\\'file:///content/.dlt/my_folder\\'\\nurl = \"https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl\"\\n# Define your pipeline\\npipeline = dlt.pipeline(\\npipeline_name=\\'my_pipeline\\',\\ndestination=\\'filesystem\\',\\ndataset_name=\\'mydata\\'\\n)\\n# Run the pipeline with the generator we created earlier.\\nload_info = pipeline.run(stream_download_jsonl(url), table_name=\"users\", loader_file_format=\"parquet\")\\nprint(load_info)\\n# Get a list of all Parquet files in the specified folder\\nparquet_files = glob.glob(\\'/content/.dlt/my_folder/mydata/users/*.parquet\\')\\n# show parquet files\\nfor file in parquet_files:\\nprint(file)',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Homework - dlt Exercise 3 - Merge a generator concerns'},\n",
       "  {'text': 'Check the contents of the repository with ls - the command.sh file should be in the root folder\\nIf it is not, verify that you had cloned the correct repository - https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'command.sh Error - source: no such file or directory: command.sh'},\n",
       "  {'text': \"psql is a command line tool that is installed alongside PostgreSQL DB, but since we've always been running PostgreSQL in a container, you've only got `pgcli`, which lacks the feature to run a sql script into the DB. Besides, having a command line for each database flavor you'll have to deal with as a Data Professional is far from ideal.\\nSo, instead, you can use usql. Check the docs for details on how to install for your OS. On macOS, it supports `homebrew`, and on Windows, it supports scoop.\\nSo, to run the taxi_trips.sql script with usql:\",\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'psql - command not found: psql (alternative install)'},\n",
       "  {'text': 'If you encounter this error and are certain that you have docker compose installed, but typically run it as docker compose without the hyphen, then consider editing command.sh file by removing the hyphen from ‘docker-compose’. Example:\\nstart-cluster() {\\ndocker compose -f docker/docker-compose.yml up -d\\n}',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Setup - source command.sh - error: “docker-compose” not found'},\n",
       "  {'text': 'ERROR: The Compose file \\'./docker/docker-compose.yml\\' is invalid because:\\nInvalid top-level property \"x-image\". Valid top-level sections for this Compose file are: version, services, networks, volumes, secrets, configs, and extensions starting with \"x-\".\\nYou might be seeing this error because you\\'re using the wrong Compose file version. Either specify a supported version (e.g \"2.2\" or \"3.3\") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.\\nFor more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/\\nIf you encounter the above error and have docker-compose installed, try updating your version of docker-compose. At the time of reporting this issue (March 17 2024), Ubuntu does not seem to support a docker-compose version high enough to run the required docker images. If you have this error and are on a Ubuntu machine, consider starting a VM with a Debian machine or look for an alternative way to download docker-compose at the latest version on your machine.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Setup - start-cluster error: Invalid top-level property x-image'},\n",
       "  {'text': 'Ans: [source] Yes, it is so that we can observe the changes as we’re working on the queries in real-time. The script is changing the date timestamp to the current time, so our queries with the now()filter would work. Open another terminal tab to copy+paste the queries while the stream-kafka script is running in the background.\\nNoel: I have recently increased this up to 100 at a time, you may pull the latest changes from the repository.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'stream-kafka Qn: Is it expected that the records are being ingested 10 at a time?'},\n",
       "  {'text': 'Ans: No, it is not.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Setup - Qn: Is kafka install required for the RisingWave workshop? [source]'},\n",
       "  {'text': 'Ans: about 7GB free for all the containers to be provisioned and then the psql still needs to run and ingest the taxi data, so maybe 10gb in total?',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Setup - Qn: How much free disk space should we have? [source]'},\n",
       "  {'text': 'Replace psycopg2==2.9.9 with psycopg2-binary in the requirements.txt file [source] [another]\\nWhen you open another terminal to run the psql, remember to do the source command.sh step for each terminal session\\n---------------------------------------------------------------------------------------------',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Psycopg2 - issues when running stream-kafka script'},\n",
       "  {'text': \"If you’re using an Anaconda installation:\\nCd home/\\nConda install gcc\\nSource back to your RisingWave Venv - source .venv/bin/activate\\nPip install psycopg2-binary\\nPip install -r requirements.txt\\nFor some reason this worked - the Conda base doesn’t have the GCC installed - (GNU Compiler Collection) a compiler system that supports various programming languages. Without this the it fails to install pyproject.toml-based projects\\n“It's possible that in your specific environment, the gcc installation was required at the system level rather than within the virtual environment. This can happen if the build process for psycopg2 tries to access system-level dependencies during installation.\\nInstalling gcc in your main Python installation (Conda) would make it available system-wide, allowing any Python environment to access it when necessary for building packages.”\\ngcc stands for GNU Compiler Collection. It is a compiler system developed by the GNU Project that supports various programming languages, including C, C++, Objective-C, and Fortran.\\nGCC is widely used for compiling source code written in these languages into executable programs or libraries. It's a key tool in the software development process, particularly in the compilation stage where source code is translated into machine code that can be executed by a computer's processor.\\nIn addition to compiling source code, GCC also provides various optimization options, debugging support, and extensive documentation, making it a powerful and versatile tool for developers across different platforms and architectures.\\n—-----------------------------------------------------------------------------------\",\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Psycopg2 - `Could not build wheels for psycopg2, which is required to install pyproject.toml-based projects`'},\n",
       "  {'text': \"Below I have listed some steps I took to rectify this and potentially other minor errors, in Windows:\\nUse the git bash terminal in windows.\\nActivate python venv from git bash: source .venv/Scripts/activate\\nModify the seed_kafka.py file: in the first line, replace python3 with python.\\nNow from git bash, run the seed-kafka cmd. It should work now.\\nAdditional Notes:\\nYou can connect to the RisingWave cluster from Powershell with the command psql -h localhost -p 4566 -d dev -U root , otherwise it asks for a password.\\nThe equivalent of source commands.sh  in Powershell is . .\\\\commands.sh from the workshop directory.\\nHope this can save you from some trouble in case you're doing this workshop on Windows like I am.\\n—--------------------------------------------------------------------------------------\",\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Psycopg2 InternalError: Failed to run the query - when running the seed-kafka command after initial setup.'},\n",
       "  {'text': 'In case the script gets stuck on\\n%3|1709652240.100|FAIL|rdkafka#producer-2| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT)gre\\nafter trying to load the trip data, check the logs of the message_queue container in docker. If it keeps restarting with Could not initialize seastar: std::runtime_error (insufficient physical memory: needed 4294967296 available 4067422208)  as the last message, then go to the docker-compose file in the docker folder of the project and change the ‘memory’ command for the message_queue service for some lower value.\\nSolution: lower the memory allocation of the service “message_queue” in your docker-compose file from 4GB. If you have the “insufficient physical memory” error message (try 3GB)\\nIssue: Running psql -f risingwave-sql/table/trip_data.sql after starting services with ‘default’ values using docker-compose up gives the error  “psql:risingwave-sql/table/trip_data.sql:61: ERROR:  syntax error at or near \".\" LINE 60:       properties.bootstrap.server=\\'message_queue:29092\\'”\\nSolution: Make sure you have run source commands.sh in each terminal window',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Running stream-kafka script gets stuck on a loop with Connection Refused'},\n",
       "  {'text': 'Use seed-kafka instead of stream-kafka to get a static set of results.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'For the homework questions is there a specific number of records that have to be processed to obtain the final answer?'},\n",
       "  {'text': 'It is best to use the order by and limit clause on the query to the materialized view instead of the materialized view creation in order to guarantee consistent results\\nHomework - The answers in the homework do not match the provided options: You must follow the following steps: 1. clean-cluster 2. docker volume prune and use seed-kafka instead of stream-kafka. Ensure that the number of records is 100K.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Homework - Materialized view does not guarantee order by warning'},\n",
       "  {'text': 'For this workshop, and if you are following the view from Noel (2024) this requires you to install postgres to use it on your terminal. Found this steps (commands) to get it done [source]:\\nwget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\\nsudo sh -c \\'echo \"deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main\" >> /etc/apt/sources.list.d/pgdg.list\\'\\nsudo apt update\\napt install postgresql postgresql-contrib\\n(comment): now let’s check the service for postgresql\\nservice postgresql status\\n(comment) If down: use the next command\\nservice postgresql start\\n(comment) And your are done',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'How to install postgress on Linux like OS'},\n",
       "  {'text': 'Refer to the solution given in the first solution here:\\nhttps://stackoverflow.com/questions/24683221/xdg-open-no-method-available-even-after-installing-xdg-utils\\nInstead of w3m use any other browser of your choice.\\nIt is just trying to open the index.html file. Which you can do from your File Explorer/Finder. If you’re on wsl try using explorer.exe index.html',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Unable to Open Dashboard as xdg-open doesn’t open any browser'},\n",
       "  {'text': 'Example Error:\\nWhen attempting to execute a Python script named seed-kafka.py or server.py with the following shebang line specifying Python 3 as the interpreter:\\nUsers may encounter the following error in a Unix-like environment:\\nThis error indicates that there is a problem with the Python interpreter path specified in the shebang line. The presence of the \\\\r character suggests that the script was edited or created in a Windows environment, causing the interpreter path to be incorrect when executed in Unix-like environments.\\n2 Solutions:\\nEither one or the other\\nUpdate Shebang Line:\\nVerify Python Interpreter Path: Use the which python3 command to determine the path to the Python 3 interpreter available in the current environment.\\nUpdate Shebang Line: Open the script file in a text editor. Modify the shebang line to point to the correct Python interpreter path found in the previous step. Ensure that the shebang line is consistent with the Python interpreter path in the execution environment.\\nExample Shebang Line:\\nReplace /usr/bin/env python3 with the correct Python interpreter path found using which python3.\\nConvert Line Endings:\\nUse the dos2unix command-line tool to convert the line endings of the script from Windows-style to Unix-style.\\nThis removes the extraneous carriage return characters (\\\\r), resolving issues related to unexpected tokens and ensuring compatibility with Unix-like environments.\\nExample Command:',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'Resolving Python Interpreter Path Inconsistencies in Unix-like Environments'},\n",
       "  {'text': 'Ans : Windowing in streaming SQL involves defining a time-based or row-based boundary for data processing. It allows you to analyze and aggregate data over specific time intervals or based on the number of events received, providing a way to manage and organize streaming data for analysis.',\n",
       "   'section': 'Workshop 2 - RisingWave',\n",
       "   'question': 'How does windowing work in Sql?'},\n",
       "  {'text': 'Python 3.12.1, is not compatible with kafka-python-2.0.2. Therefore, instead of running \"pip install kafka-python\", you can resolve the issue by using \"pip install git+https://github.com/dpkp/kafka-python.git\". If you have already installed kafka-python, you need to run \"pip uninstall kafka-python\" before executing \"pip install git+https://github.com/dpkp/kafka-python.git\" to resolve the compatibility issue.\\nQ:In the Mage pipeline, individual blocks run successfully. However, when executing the pipeline as a whole, some blocks fail.\\nA: I have the following key-value pair in io_config.yaml file configured but still Mage blocks failed to generate OAuth and authenticate with GCP: GOOGLE_SERVICE_ACC_KEY_FILEPATH: \"{{ env_var(\\'GCP_CREDENTIALS\\') }}\". The GCP_CREDENTIALS variable holds the full path to the service account key\\'s JSON file. Adding the following line within the failed code block resolved the issue: os.environ[\\'GOOGLE_APPLICATION_CREDENTIALS\\'] = os.environ.get(\\'GCP_CREDENTIALS\\').\\nThis occurs because the path to profiles.yml is not correctly specified. You can rectify this by:\\n“export DBT_PROFILES_DBT=path/to/profiles.yml”\\nEg., /home/src/magic-zoomcamp/dbt/project_name/\\nDo the similar for DBT_PROJECT_DIR if getting similar issue with dbt_project.yml.\\nOnce DIRs are set,:\\n“dbt debug –config-dir”\\nThis would update your paths. To maintain same path across sessions, use the path variables in your .env file.\\nTo add triggers in mage pipelines via CLI, you can create a trigger of type API, and copy the API links.\\nEg. link: http://localhost:6789/api/pipeline_schedules/10/pipeline_runs/f3a1a4228fc64cfd85295b668c93f3b2\\nThen create a trigger.py as such:\\nimport os\\nimport requests\\nclass MageTrigger:\\nOPTIONS = {\\n\"<pipeline_name>\": {\\n\"trigger_id\": 10,\\n\"key\": \"f3a1a4228fc64cfd85295b668c93f3b2\"\\n}\\n}\\n@staticmethod\\ndef trigger_pipeline(pipeline_name, variables=None):\\ntrigger_id = MageTrigger.OPTIONS[pipeline_name][\"trigger_id\"]\\nkey = MageTrigger.OPTIONS[pipeline_name][\"key\"]\\nendpoint = f\"http://localhost:6789/api/pipeline_schedules/{trigger_id}/pipeline_runs/{key}\"\\nheaders = {\\'Content-Type\\': \\'application/json\\'}\\npayload = {}\\nif variables is not None:\\npayload[\\'pipeline_run\\'] = {\\'variables\\': variables}\\nresponse = requests.post(endpoint, headers=headers, json=payload)\\nreturn response\\nMageTrigger.trigger_pipeline(\"<pipeline_name>\")\\nFinally, after the mage server is up an running, simply this command:\\npython trigger.py from mage directory in terminal.\\nCan I do data partitioning & clustering run by dbt pipeline, or I would need to do this manually in BigQuery afterwards?\\nYou can use this configuration in your DBT model:\\n{\\n\"field\": \"<field name>\",\\n\"data_type\": \"<timestamp | date | datetime | int64>\",\\n\"granularity\": \"<hour | day | month | year>\"\\n# Only required if data_type is \"int64\"\\n\"range\": {\\n\"start\": <int>,\\n\"end\": <int>,\\n\"interval\": <int>\\n}\\n}\\nand for clustering\\n{{\\nconfig(\\nmaterialized = \"table\",\\ncluster_by = \"order_id\",\\n)\\n}}\\nmore details in: https://docs.getdbt.com/reference/resource-configs/bigquery-configs',\n",
       "   'section': 'Triggers in Mage via CLI',\n",
       "   'question': 'Encountering the error \"ModuleNotFoundError: No module named \\'kafka.vendor.six.moves\\'\" when running \"from kafka import KafkaProducer\" in Jupyter Notebook for Module 6 Homework?'},\n",
       "  {'text': 'Docker Commands\\n# Create a Docker Image from a base image\\nDocker run -it ubuntu bash\\n#List docker images\\nDocker images list\\n#List  Running containers\\nDocker ps -a\\n#List with full container ids\\nDocker ps -a --no-trunc\\n#Add onto existing image to create new image\\nDocker commit -a <User_Name> -m \"Message\" container_id New_Image_Name\\n# Create a Docker Image with an entrypoint from a base image\\nDocker run -it --entry_point=bash python:3.11\\n#Attach to a stopped container\\nDocker start -ai <Container_Name>\\n#Attach to a running container\\ndocker exec -it <Container_ID> bash\\n#copying from host to container\\nDocker cp <SRC_PATH/file> <containerid>:<dest_path>\\n#copying from container to host\\nDocker cp <containerid>:<Srct_path> <Dest Path on host/file>\\n#Create an image from a docker file\\nDocker build -t <Image_Name> <Location of Dockerfile>\\n#DockerFile Options and best practices\\nhttps://devopscube.com/build-docker-image/\\n#Docker delete all images forcefully\\ndocker rmi -f $(docker images -aq)\\n#Docker delete all containers forcefully\\ndocker rm -f $(docker ps -qa)\\n#docker compose creation\\nhttps://www.composerize.com/\\nGCP Commands\\n1.     Create SSH Keys\\n2.     Added to the Settings of Compute Engine VM Instance\\n3.     SSH-ed into the VM Instance with a config similar to following\\nHost my-website.com\\nHostName my-website.com\\nUser my-user\\nIdentityFile ~/.ssh/id_rsa\\n4.     Installed Anaconda by installing the sh file through bash <Anaconda.sh>\\n5.     Install Docker after\\na.     Sudo apt-get update\\nb.     Sudo apt-get docker\\n6.     To run Docker without SUDO permissions\\na.     https://github.com/sindresorhus/guides/blob/main/docker-without-sudo.md\\n7.     Google cloud remote copy\\na.     gcloud compute scp LOCAL_FILE_PATHVM_NAME:REMOTE_DIR\\nInstall GCP Cloud SDK on Docker Machine\\nhttps://stackoverflow.com/questions/23247943/trouble-installing-google-cloud-sdk-in-ubuntu\\nsudo apt-get install apt-transport-https ca-certificates gnupg && echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\"| sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && sudo apt-get update && sudo apt-get install google-cloud-sdk && sudo apt-get install google-cloud-sdk-app-engine-java && sudo apt-get install google-cloud-sdk-app-engine-python && gcloud init\\nAnaconda Commands\\n#Activate environment\\nConda Activate <environment_name>\\n#DeActivate environment\\nConda DeActivate <environment_name>\\n#Start iterm without conda environment\\nconda config --set auto_activate_base false\\n# Using Conda forge as default (Community driven packaging recipes and solutions)\\nhttps://conda-forge.org/docs/user/introduction.html\\nconda --version\\nconda update conda\\nconda config --add channels conda-forge\\nconda config --set channel_priority strict\\n#Using Libmamba as Solver\\nconda install pgcli  --solver=libmamba\\nLinux/MAC Commands\\nStarting and Stopping Services on Linux\\n●  \\tsudo systemctl start postgresql\\n●  \\tsudo systemctl stop postgresql\\nStarting and Stopping Services on MAC\\n●      launchctl start postgresql\\n●      launchctl stop postgresql\\nIdentifying processes listening to a Port across MAC/Linux\\nsudo lsof -i -P -n | grep LISTEN\\n$ sudo netstat -tulpn | grep LISTEN\\n$ sudo ss -tulpn | grep LISTEN\\n$ sudo lsof -i:22 ## see a specific port such as 22 ##\\n$ sudo nmap -sTU -O IP-address-Here\\nInstalling a package on Debian\\nsudo apt install <packagename>\\nListing all package on Debian\\nDpkg -l | grep <packagename>\\nUnInstalling a package on Debian\\nSudo apt remove <packagename>\\nSudo apt autoclean  && sudo apt autoremove\\nList all Processes on Debian/Ubuntu\\nPs -aux\\napt-get update && apt-get install procps\\napt-get install iproute2 for ss -tulpn\\n#Postgres Install\\nsudo sh -c \\'echo \"deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main\" > /etc/apt/sources.list.d/pgdg.list\\'\\nwget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\\nsudo apt-get update\\nsudo apt-get -y install postgresql\\n#Changing Postgresql port to 5432\\n- sudo service postgresql stop - sed -e \\'s/^port.*/port = 5432/\\' /etc/postgresql/10/main/postgresql.conf > postgresql.conf\\n- sudo chown postgres postgresql.conf\\n- sudo mv postgresql.conf /etc/postgresql/10/main\\n- sudo systemctl restart postgresql',\n",
       "   'section': 'Triggers in Mage via CLI',\n",
       "   'question': 'Basic Commands'}]}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents_raw[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8a9efb9",
   "metadata": {},
   "source": [
    "We need to create a collection first. Qdrant will handle the IDF calculations, if we configure it to. That's required for BM25, otherwise it won't boost the rare words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f0c5a4c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from qdrant_client import models\n",
    "\n",
    "# Create the collection with specified sparse vector parameters\n",
    "client.create_collection(\n",
    "    collection_name=\"zoomcamp-sparse\",\n",
    "    sparse_vectors_config={\n",
    "        \"bm25\": models.SparseVectorParams(\n",
    "            modifier=models.Modifier.IDF,\n",
    "        )\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bd3171a",
   "metadata": {},
   "source": [
    "FastEmbed comes with a BM25 implementation that we can use as any other model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8cb45281",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import uuid\n",
    "\n",
    "# Send the points to the collection\n",
    "client.upsert(\n",
    "    collection_name=\"zoomcamp-sparse\",\n",
    "    points=[\n",
    "        models.PointStruct(\n",
    "            id=uuid.uuid4().hex,\n",
    "            vector={\n",
    "                \"bm25\": models.Document(\n",
    "                    text=doc[\"text\"], \n",
    "                    model=\"Qdrant/bm25\",\n",
    "                ),\n",
    "            },\n",
    "            payload={\n",
    "                \"text\": doc[\"text\"],\n",
    "                \"section\": doc[\"section\"],\n",
    "                \"course\": course[\"course\"],\n",
    "            }\n",
    "        )\n",
    "        for course in documents_raw\n",
    "        for doc in course[\"documents\"]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "906b4d86",
   "metadata": {},
   "source": [
    "You might be surprised how fast the upload operation was. BM25 does not require a neural network, so it is fast compared to dense embedding models.\n",
    "\n",
    "## Step 3: Running sparse vector search with BM25\n",
    "\n",
    "Right now, our vectors are ready to be searched over. Let's create a helper function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "949bd8f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def search(query: str, limit: int = 1) -> list[models.ScoredPoint]:\n",
    "    results = client.query_points(\n",
    "        collection_name=\"zoomcamp-sparse\",\n",
    "        query=models.Document(\n",
    "            text=query,\n",
    "            model=\"Qdrant/bm25\",\n",
    "        ),\n",
    "        using=\"bm25\",\n",
    "        limit=limit,\n",
    "        with_payload=True,\n",
    "    )\n",
    "\n",
    "    return results.points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3b7ac1d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = search(\"Qdrant\")\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1301b61",
   "metadata": {},
   "source": [
    "Sparse vectors can return no results, if none of the keywords from the query were ever used in the documents. No matter if there are some synonyms. Terminology does matter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8347afa5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You can use round() function or f-strings\n",
      "round(number, 4)  - this will round number up to 4 decimal places\n",
      "print(f'Average mark for the Homework is {avg:.3f}') - using F string\n",
      "Also there is pandas.Series. round idf you need to round values in the whole Series\n",
      "Please check the documentation\n",
      "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round\n",
      "Added by Olga Rudakova\n"
     ]
    }
   ],
   "source": [
    "results = search(\"pandas\")\n",
    "print(results[0].payload[\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5097d9c5",
   "metadata": {},
   "source": [
    "Scores returned by BM25 are not calculated with cosine similarity, but with BM25 formula. They are not bounded to a specific range, but are virtually unbounded. Let's see how they may look like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c252b3f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.0392046"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results[0].score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea7933a1",
   "metadata": {},
   "source": [
    "That's an important observation before we start implementing hybrid search.\n",
    "\n",
    "### Natural language like queries\n",
    "\n",
    "Let's try again with a random question from our dataset to see how well sparse vector search can work with longer, natural language like queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2b71aafd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"text\": \"Even though the upload works using aws cli and boto3 in Jupyter notebook.\\nSolution set the AWS_PROFILE environment variable (the default profile is called default)\",\n",
      "  \"section\": \"Module 4: Deployment\",\n",
      "  \"question\": \"Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\\\"\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import json\n",
    "\n",
    "random.seed(202506)\n",
    "\n",
    "course = random.choice(documents_raw)\n",
    "course_piece = random.choice(course[\"documents\"])\n",
    "print(json.dumps(course_piece, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "65cd0cbc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).\n",
      "Detailed explanation here: https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment\n",
      "Source code example here: https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py\n"
     ]
    }
   ],
   "source": [
    "results = search(course_piece[\"question\"])\n",
    "print(results[0].payload[\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "445977a9",
   "metadata": {},
   "source": [
    "### Step 4: Qdrant Universal Query API - prefetching\n",
    "\n",
    "Qdrant's `.query_points` method allows building multi-step search pipelines which can incorporate various methods into a single call. For example, we can retrieve some candidates with dense vector search, and then rerank them with sparse search, or use a fast method for initial retrieval and precise, but slow, reranking.\n",
    "\n",
    "```ascii\n",
    "┌─────────────┐           ┌─────────────┐\n",
    "│             │           │             │\n",
    "│  Retrieval  │ ────────► │  Reranking  │\n",
    "│             │           │             │\n",
    "└─────────────┘           └─────────────┘\n",
    "```\n",
    "\n",
    "Let's create another collection that will keep both dense and sparse representations. Qdrant named vectors allow us to store multiple representations per point and it proves useful especially when we want to use mulitple models in our applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a153b239",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create the collection with both vector types\n",
    "client.create_collection(\n",
    "    collection_name=\"zoomcamp-sparse-and-dense\",\n",
    "    vectors_config={\n",
    "        # Named dense vector for jinaai/jina-embeddings-v2-small-en\n",
    "        \"jina-small\": models.VectorParams(\n",
    "            size=512,\n",
    "            distance=models.Distance.COSINE,\n",
    "        ),\n",
    "    },\n",
    "    sparse_vectors_config={\n",
    "        \"bm25\": models.SparseVectorParams(\n",
    "            modifier=models.Modifier.IDF,\n",
    "        )\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dd1ebe4",
   "metadata": {},
   "source": [
    "We have to upload all the vectors into the newly created collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "40bcff2c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "client.upsert(\n",
    "    collection_name=\"zoomcamp-sparse-and-dense\",\n",
    "    points=[\n",
    "        models.PointStruct(\n",
    "            id=uuid.uuid4().hex,\n",
    "            vector={\n",
    "                \"jina-small\": models.Document(\n",
    "                    text=doc[\"text\"],\n",
    "                    model=\"jinaai/jina-embeddings-v2-small-en\",\n",
    "                ),\n",
    "                \"bm25\": models.Document(\n",
    "                    text=doc[\"text\"], \n",
    "                    model=\"Qdrant/bm25\",\n",
    "                ),\n",
    "            },\n",
    "            payload={\n",
    "                \"text\": doc[\"text\"],\n",
    "                \"section\": doc[\"section\"],\n",
    "                \"course\": course[\"course\"],\n",
    "            }\n",
    "        )\n",
    "        for course in documents_raw\n",
    "        for doc in course[\"documents\"]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8cf16f3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def multi_stage_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:\n",
    "    results = client.query_points(\n",
    "        collection_name=\"zoomcamp-sparse-and-dense\",\n",
    "        prefetch=[\n",
    "            models.Prefetch(\n",
    "                query=models.Document(\n",
    "                    text=query,\n",
    "                    model=\"jinaai/jina-embeddings-v2-small-en\",\n",
    "                ),\n",
    "                using=\"jina-small\",\n",
    "                # Prefetch ten times more results, then\n",
    "                # expected to return, so we can really rerank\n",
    "                limit=(10 * limit),\n",
    "            ),\n",
    "        ],\n",
    "        query=models.Document(\n",
    "            text=query,\n",
    "            model=\"Qdrant/bm25\", \n",
    "        ),\n",
    "        using=\"bm25\",\n",
    "        limit=limit,\n",
    "        with_payload=True,\n",
    "    )\n",
    "\n",
    "    return results.points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "79320aff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"text\": \"Even though the upload works using aws cli and boto3 in Jupyter notebook.\\nSolution set the AWS_PROFILE environment variable (the default profile is called default)\",\n",
      "  \"section\": \"Module 4: Deployment\",\n",
      "  \"question\": \"Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\\\"\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(json.dumps(course_piece, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "21b43d68",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Problem description. How can we connect s3 bucket to MLFLOW?\n",
      "Solution: Use boto3 and AWS CLI to store access keys. The access keys are what will be used by boto3 (AWS' Python API tool) to connect with the AWS servers. If there are no Access Keys how can they make sure that they have the right to access this Bucket? Maybe you're a malicious actor (Hacker for ex). The keys must be present for boto3 to talk to the AWS servers and they will provide access to the Bucket if you possess the right permissions. You can always set the Bucket as public so anyone can access it, now you don't need access keys because AWS won't care.\n",
      "Read more here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html\n",
      "Added by Akshit Miglani\n"
     ]
    }
   ],
   "source": [
    "results = multi_stage_search(course_piece[\"question\"])\n",
    "print(results[0].payload[\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c0af186",
   "metadata": {},
   "source": [
    "## Step 5: Building Hybrid Search\n",
    "\n",
    "In real production systems, you don't need to choose just one vector type. You never know what kind of queries your users will send to the system. E-commerce search might be just fine with lexical search on top of sparse vectors, as people will tend to send keywords, but in conversational systems, such as chatbots, natural language questions might be more frequent. Using one model as a retriever and another one as reranker is not the only way of how to use dense and sparse in a single system.\n",
    "\n",
    "Hybrid Search is a technique for combining results coming from different search methods - for example dense and sparse. There isn't a clear definition of how exactly to implement it, as the main problem is how to mix results coming from methods which are incompatible. Dense and sparse search scores can't be compared directly, so we need another method that will order the final results somehow.\n",
    "\n",
    "There are two terms important for Hybrid Search: **fusion** and **reranking**.\n",
    "\n",
    "### Fusion\n",
    "\n",
    "Fusion is a set of methods which work on the scores/ranking as returned by the individual methods. There are various ways of how to achieve that, but Reciprocal Rank Fusion is the most popular technique. It is based on the rankings of the documents in each methods used, and these rankings are used to calculate the final scores. You will never calculate these scores, as Qdrant has some built-in capabilities that we will use. However, the following example can give you a rough intuition:\n",
    "\n",
    "| Document | Dense ranking | Sparse ranking | RRF score | Final ranking |\n",
    "| --- | --- | --- | --- | --- |\n",
    "| D1 | **1** | 5 | 0.0318 | 2 |\n",
    "| D2 | 2 | 4 | 0.0317 | 3 |\n",
    "| D3 | 3 | 2 | 0.0320 | **1** |\n",
    "| D4 | 4 | 3 | 0.0315 | 5 |\n",
    "| D5 | 5 | **1** | 0.0318 | 2 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c7de1c26",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rrf_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:\n",
    "    results = client.query_points(\n",
    "        collection_name=\"zoomcamp-sparse-and-dense\",\n",
    "        prefetch=[\n",
    "            models.Prefetch(\n",
    "                query=models.Document(\n",
    "                    text=query,\n",
    "                    model=\"jinaai/jina-embeddings-v2-small-en\",\n",
    "                ),\n",
    "                using=\"jina-small\",\n",
    "                limit=(5 * limit),\n",
    "            ),\n",
    "            models.Prefetch(\n",
    "                query=models.Document(\n",
    "                    text=query,\n",
    "                    model=\"Qdrant/bm25\",\n",
    "                ),\n",
    "                using=\"bm25\",\n",
    "                limit=(5 * limit),\n",
    "            ),\n",
    "        ],\n",
    "        # Fusion query enables fusion on the prefetched results\n",
    "        query=models.FusionQuery(fusion=models.Fusion.RRF),\n",
    "        with_payload=True,\n",
    "    )\n",
    "\n",
    "    return results.points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5e82ccd7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"text\": \"Even though the upload works using aws cli and boto3 in Jupyter notebook.\\nSolution set the AWS_PROFILE environment variable (the default profile is called default)\",\n",
      "  \"section\": \"Module 4: Deployment\",\n",
      "  \"question\": \"Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\\\"\"\n",
      "}\n",
      "When executing an AWS CLI command (e.g., aws s3 ls), you can get the error <botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>.\n",
      "To fix it, simply set the AWS CLI environment variables:\n",
      "export AWS_DEFAULT_REGION=eu-west-1\n",
      "export AWS_ACCESS_KEY_ID=foobar\n",
      "export AWS_SECRET_ACCESS_KEY=foobar\n",
      "Their value is not important; anything would be ok.\n",
      "Added by Giovanni Pecoraro\n"
     ]
    }
   ],
   "source": [
    "results = rrf_search(course_piece[\"question\"])\n",
    "print(json.dumps(course_piece, indent=2))\n",
    "print(results[0].payload[\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f645833e",
   "metadata": {},
   "source": [
    "### Reranking\n",
    "\n",
    "Reranking is a broader term related to Hybrid Search. Fusion is one of the ways to rerank the results of multiple methods, but you can also apply a slower method that won't be effective enough to search over all the documents. But there is more to it. Business rules are often important for retrieval, as you prefer to show documents coming from the most recent news, for instance.\n",
    "\n",
    "## Next steps\n",
    "\n",
    "Dense and sparse vector search methods might not be enough in some cases, but both are fast enough to be used as initial retrievers. Plenty of more accurate yet slower methods exist, such as cross-encoders or [multivector representations](https://qdrant.tech/documentation/advanced-tutorials/using-multivector-representations/). These topics are definitely more advanced, and we won't cover them right now. However, it's good to mention them so you are aware they exist."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
